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, Abstract. We propose robust methods for inference on the effect of a treatment variable on 

a scalar outcome in the presence of very many controls. Our setting is a partially linear model 
with possibly non-Gaussian and heteroscedastic disturbances where the number of controls 
may be much larger than the sample size. To make informative inference feasible, we require 
the model to be approximately sparse; that is, we require that the effect of confounding factors 
can be controlled for up to a small approximation error by conditioning on a relatively small 
number of controls whose identities are unknown. The latter condition makes it possible to 
estimate the treatment effect by selecting approximately the right set of controls. We develop a 
novel estimation and uniformly valid inference method for the treatment effect in this setting, 
called the "post-double-selection" method. Our results apply to Lasso-type methods used for 
covariate selection as well as to any other model selection method that is able to find a sparse 
model with good approximation properties. 

The main attractive feature of our method is that it allows for imperfect selection of the 
controls and provides confidence intervals that are valid uniformly across a large class of mod- 
els. In contrast, standard post-model selection estimators fail to provide uniform inference 
even in simple cases with a small, fixed number of controls. Thus our method resolves the 
problem of uniform inference after model selection for a large, interesting class of models. We 
illustrate the use of the developed methods with numerical simulations and an application to 
the effect of abortion on crime rates. 
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1. Introduction 



Many empirical analyses in economics focus on estimating the structural, causal, or treat- 
ment effect of some variable on an outcome of interest. For example, we might be interested 
in estimating the causal effect of some government policy on an economic outcome such as 
employment. Since economic policies and many other economic variables are not randomly 
assigned, economists rely on a variety of quasi-experimental approaches based on observational 
data when trying to estimate such effects. One important method is based on the assumption 
that the variable of interest can be taken as randomly assigned once a sufficient set of other 
factors has been controlled for. Economists, for example, might argue that changes in state- 
level public policies can be taken as randomly assigned relative to unobservable factors that 
could affect changes in state-level outcomes after controlling for aggregate macroeconomic ac- 
tivity, state-level economic activity, and state-level demographics; see, for example, Heckman, 
LaLonde, and Smith (1999) or Imbens (2004). 

A problem empirical researchers face when relying on an identification strategy for estimating 
a structural effect that relies on a conditional on observables argument is knowing which 
controls to include. Typically, economic intuition will suggest a set of variables that might be 
important but will not identify exactly which variables are important or the functional form 
with which variables should enter the model. This lack of clear guidance about what variables 
to use leaves researchers with the problem of selecting a set of controls from a potentially vast 
set of control variables including raw regressors available in the data as well as interactions 
and other transformations of these regressors. A typical economic study will rely on an ad hoc 
sensitivity analysis in which a researcher reports results for several different sets of controls 
in an attempt to show that the parameter of interest that summarizes the causal effect of the 
policy variable is insensitive to changes in the set of control variables. See Donohue III and 
Levitt (2001), which we use as the basis for the empirical study in this paper, or examples in 
Angrist and Pischke (2008) among many other references. 

We present an approach to estimating and performing inference on structural effects in 
an environment where the treatment variable may be taken as exogenous conditional on ob- 
servables that complements existing strategies. We pose the problem in the framework of a 
partially linear model 



where d{ is the treatment/policy variable of interest, Z\ is a set of control variables, and Q 
is an unobservable that satisfies E[£j | di,Zi] = 00 The goal of the econometric analysis 
is to conduct inference on the treatment effect «o- We examine the problem of selecting a 

1 We note that di does not need to be binary. 
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set of variables from among p potential controls Xi = P(zi), which may consist of Z{ and 
transformations of Zi, to adequately approximate g(zi) allowing for p > n. Of course, useful 
inference about ocq is unavailable in this framework without imposing further structure on the 
data. We impose such structure by assuming that exogeneity of d{ may be taken as given once 
one controls linearly for a relatively small number s < n of variables in X{ whose identities 
are a priori unknown. This assumption implies that a linear combination of these s unknown 
controls provides an approximation to g{zi) which produces relatively small approximation 
errors^ This assumption, which is termed approximate sparsity or simply sparsity, allows us 
to approach the problem of estimating oq as a variable selection problem. This framework 
allows for the realistic scenario in which the researcher is unsure about exactly which variables 
or transformations are important for approximating g{zi) and so must search among a broad 
set of controls. 

The assumed sparsity includes as special cases the most common approaches to parametric 
and nonpar ametric regression analysis. Sparsity justifies the use of fewer variables than there 
are observations in the sample. When the initial number of variables is high, the assumption 
justifies the use of variable selection methods to reduce the number of variables to a manage- 
able size. In many economic applications, formal and informal strategies are often used to 
select such smaller sets of potential control variables. Most of these standard variable selec- 
tion strategies are non-robust and may produce poor inference^! In an effort to demonstrate 
robustness of their conclusions, researchers often employ ad hoc sensitivity analyses which 
examine the robustness of inferential conclusions to variations in the set of controls. Such 
sensitivity analyses are useful but lack rigorous justification. As a complement to these ad 
hoc approaches, we propose a formal, rigorous approach to inference allowing for selection of 
controls. Our proposal uses modern variable selection methods in a novel manner which results 
in robust and valid inference. 

The main contributions of this paper are providing a robust estimation and inference method 
within a partially linear model with potentially very high-dimensional controls and developing 
the supporting theory. The method relies on the use of Lasso-type or other sparsity-inducing 
procedures for variable selection. Our approach differs from usual post-model-selection meth- 
ods that rely on a single selection step. Rather, we use two different variable selection steps 
followed by a final estimation step as follows: 



We carefully define what we mean by small approximation errors in Section 2. 
^An example of inference going wrong is given in Figure 1 (left panel), presented in the next section, where 
a standard post-model selection estimator has a bimodal distribution which sharply deviates from the standard 
normal distribution. More examples are given in Section 6 where we document the poor inferential performance 
of a standard post-model selection method. 
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1. In the first step, we select a set of control variables that are useful for predicting the 
treatment d{. This step helps to insure robustness by finding control variables that are 
strongly related to the treatment and thus potentially important confounding factors. 

2. In the second step, we select additional variables by selecting control variables that 
predict yj. This step helps to insure that we have captured important elements in 
the equation of interest, ideally helping keep the residual variance small as well as 
intuitively providing an additional chance to find important confounds. 

3. In the final step, we estimate the treatment effect ao of interest by the linear regression 
of yi on the treatment d{ and the union of the set of variables selected in the two 
variable selection steps. 

We provide theoretical results on the properties of the resulting treatment effect estimator 
and show that it provides inference that is uniformly valid over large classes of models and 
also achieves the semi-parametric efficiency bound under some conditions. Importantly, our 
theoretical results allow for imperfect variable selection in either of the two variable selection 
steps as well as allowing for non-Gaussianity and heteroscedasticity of the model's errorsQ 

We illustrate the theoretical results through an examination of the effect of abortion on 
crime rates following Donohue III and Levitt (2001). In this example, we find that the formal 
variable selection procedure produces a qualitatively different result than that obtained through 
the ad hoc set of sensitivity results presented in the original paper. By using formal variable 
selection, we select a small set of between eight and fourteen variables depending on the 
outcome, compared to the set of eight variables considered by Donohue III and Levitt (2001). 
Once this set of variables is linearly controlled for, the estimated abortion effect is rendered 
imprecise. It is interesting that the key variable selected by the variable selection procedure 
is the initial condition for the abortion rate. The selection of this initial condition and the 
resulting imprecision of the estimated treatment effect suggest that one cannot determine 
precisely whether the effect attributed to abortion found when this initial condition is omitted 
from the model is due to changes in the abortion rate or some other persistent state-level 
factor that is related to relevant changes in the abortion rate and current changes in the crime 
rateQ It is interesting that Foote and Goetz (2008) raise a similar concern based on intuitive 
grounds and additional data in a comment on Donohue III and Levitt (2001). Foote and 
Goetz (2008) find that a linear trend interacted with crime rates before abortion could have 



In a companion paper that presents an overview of results for ^i-penalized estimators, Belloni, Cher- 
nozhukov, and Hansen (2011a), we provide similar results in the idealized Gaussian homoscedastic framework. 

5 Note that all models are estimated in first-differences to eliminate any state-specific factors that might be 
related to both the relevant level of the abortion rate and the level of the crime rate. 
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had an effect renders the estimated abortion effects imprecise^] Overall, finding that a formal, 
rigorous approach to variable selection produces a qualitatively different result than a more ad 
hoc approach suggests that these methods might be used to complement economic intuition in 
selecting control variables for estimating treatment effects in settings where treatment is taken 
as exogenous conditional on observables. 

Relationship to literature. We contribute to several existing literatures. First, we con- 
tribute to the literature on series estimation of partially linear models (Donald and Newey 
(1994), Hardle, Liang, and Gao (2000), Robinson (1988), and others). We differ from most of 
the existing literature which considers p <C n series terms by allowing p> n series terms from 
which we select s~ <C n terms to construct the regression fits. Considering an initial broad set 
of terms allows for more refined approximations of regression functions relative to the usual 
approach that uses only a few low-order terms. See, for example, Belloni, Chernozhukov, and 
Hansen (2011a) for a wage function example and Section 5 for theoretical examples. However, 
our most important contribution is to allow for data-dependent selection of the appropriate 
series terms. The previous literature on inference in the partially linear model generally takes 
the series terms as given without allowing for their data-driven selection. However, selection 
of series terms is crucial for achieving consistency when p ^> n and is needed for increasing 
efficiency even when p = Cn with C < 1. That the standard estimator can be be highly 
inefficient in the latter case follows from results in Cattaneo, Jansson, and Newey (2010) Jj We 
focus on Lasso for performing this selection as a theoretically and computationally attractive 
device but note that any other method, such as selection using the traditional generalized 
cross-validation criteria, will work as long as the method guarantees sufficient sparsity in its 
solution. After model selection, one may apply conventional standard errors or the refined 
standard errors proposed by Cattaneo, Jansson, and Newey (2010)0 



6 Donohue III and Levitt (2008) provide yet more data and a more complicated 
specification in response to Foote and Goetz (2008). In a supplement available at 

http://faculty.chicagobooth.edu/cliristian.liansen/research/ , we provide additional results based on Donohue III 
and Levitt (2008). The conclusions are similar to those obtained in this paper in that we find the estimated 
abortion effect becomes imprecise once one allows for a broad set of controls and selects among them. However, 
the specification of Donohue III and Levitt (2008) relies on a large number of district cross time fixed effects 
and so does not immediately fit into our regularity conditions. We conjecture the methodology continues to 
work in this case but leave verification to future research. 

Cattaneo, Jansson, and Newey (2010) derive properties of series estimator under p = Cn, C < 1, asymp- 
totics. It follows from their results that under homoscedasticity the series estimator achieves the semiparametric 
efficiency bound only if C — ¥ 0. 

8 If the selected number of terms JTis a substantial fraction of n, we recommend using Cattaneo, Jansson, and 
Newey (2010) standard errors after applying our model selection procedure. 
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Second, we contribute to the literature on the estimation of treatment effects. We note that 
the policy variable di does not have to be binary in our framework. However, our method has 
a useful interpretation related to the propensity score when di is binary. In the first selection 
step, we select terms from xi that predict the treatment di, i.e. terms that explain the propen- 
sity score. We also select terms from Xi that predict yi, i.e. terms that explain the outcome 
regression function. Then we run a final regression of yi on the treatment di and the union of 
selected terms. Thus, our procedure relies on the selection of variables relevant for both the 
propensity score and the outcome regression. Relying on selecting variables that are important 
for both objects allows us to achieve two goals: we obtain uniformly valid confidence sets for 
ao despite imperfect model selection and we achieve full efficiency for estimating an in the 
homoscedastic case. The relation of our approach to the propensity score brings about inter- 
esting connections to the treatment effects literature. Hahn (1998), Heckman, Ichimura, and 
Todd (1998), and Abadie and Imbens (2011) have constructed efficient regression or matching- 
based estimates of average treatment effects. Hahn (1998) also shows that conditioning on the 
propensity score is unnecessary for efficient estimation of average treatment effects. Hirano, 
Imbens, and Ridder (2003) demonstrate that one can efficiently estimate average treatment 
effects using estimated propensity score weighting alone. Robins and Rotnitzky (1995) have 
shown that using propensity score modeling coupled with a parametric regression model leads 
to efficient estimates if either the propensity score model or the parametric regression model is 
correct. While our contribution is quite distinct from these approaches, it also highlights the 
important robustness role played by the propensity score model in the selection of the right 
control terms for the final regression. 

Third, we contribute to the literature on estimation and inference with high-dimensional 
data and to the uniformity literature. There has been extensive work on estimation and 
perfect model selection in both low and high-dimensional contextsj^ but there has been little 
work on inference after imperfect model selection. Perfect model selection relies on unrealistic 
assumptions, and model selection mistakes can have serious consequences for inference as has 
been shown in Potscher (2009), Leeb and Pdtscher (2008), and others. In work on instrument 
selection for estimation of a linear instrumental variables model, Belloni, Chen, Chernozhukov, 
and Hansen (2010) have shown that model selection mistakes do not prevent valid inference 
about low-dimensional structural parameters due to the inherent adaptivity of the problem: 
Omission of a relevant instrument does not affect consistency of an IV estimator as long as 
there is another relevant instrument. The partially linear regression model does not have 
the same adaptivity structure, and model selection based on the outcome regression alone 



For reviews focused on econometric applications, see, e.g., Hansen (2005) and Belloni, Chernozhukov, and 
Hansen (2010). 
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produces non-robust confidence intervals 1^1 Our post-double selection procedure creates the 
necessary adaptivity by performing two separate model selection steps, making it possible 
to perform robust /uniform inference after model selection. The uniformity holds over large, 
interesting classes of high-dimensional sparse models. In that regard, our contribution is in the 
spirit and builds upon the classical contribution by Romano (2004) on the uniform validity of 
t-tests for the univariate mean. It also shares the spirit of recent contributions, among others, 
by Mikusheva (2007) on uniform inference in autoregressive models, by Andrews and Cheng 
(2011) on uniform inference in moment condition models that are potentially unidentified, and 
by Andrews, Cheng, and Guggenberger (2011) on a generic framework for uniformity analysis. 

Finally, we contribute to the broader literature on high-dimensional estimation. For variable 
selection we use ^-penalization methods, though our method and theory will allow for the use 
of other methods, ^i-penalized methods have been proposed for model selection problems in 
high-dimensional least squares problems, e.g. Lasso in Frank and Friedman (1993) and Tib- 
shirani (1996), in part because they are computationally efficient. Many .^-penalized methods 
have been shown to have good estimation properties even when perfect variable selection is 
not feasible; see, e.g., Candes and Tao (2007), Meinshausen and Yu (2009), Bickel, Ritov, and 
Tsybakov (2009), Huang, Horowitz, and Wei (2010), Belloni and Chernozhukov (2011b) and 
the references therein. Such methods have also been shown to extend suitably to nonpara- 
metric and non-Gaussian cases as in Bickel, Ritov, and Tsybakov (2009) and Belloni, Chen, 
Chernozhukov, and Hansen (2010). These methods also produce models with a relatively small 
set of variables. The last property is important in that it leaves the researcher with a set of 
variables that may be examined further; in addition it corresponds to the usual approach in 
economics that relies on considering a small number of controls. 

Paper Organization. In Section 2, we formally present the modeling environment includ- 
ing the key sparsity condition and develop our advocated estimation and inference method. 
We establish the consistency and asymptotic normality of our estimator of an uniformly over 
large classes of models in Section 3. In Section 4, we present a generalization of the basic pro- 
cedure to allow for model selection methods other than Lasso. In Section 5, we present a series 
of theoretical examples in which we provide primitive condition that imply the higher-level 
conditions of Section 3. In Section 6, we present a series of numerical examples that verify our 
theoretical results numerically, and we apply our method to the abortion and crime example 
of Donohue III and Levitt (2001) in Section 7. In appendices, we provide the proofs. 



The poor performance of inference on a treatment effect after model selection on only the outcome equation 
is shown through simulations in Section 6. 
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Notation. In what follows, we work with triangular array data {(cjj jn ,i = l,...,n) ,n = 
1,2,3,...} defined on probability space (£l,A, P n ) ; where P = P n can change with n. Each 
^i,n = {y'i, n -> z i,ni ^'i n)' ^ s a vec tor with components defined below, and these vectors are i.n.i.d. 
- independent across i, but not necessarily identically distributed. Thus, all parameters that 
characterize the distribution of {uji tn ,i = l,...,n} are implicitly indexed by P n and thus by 
n. We omit the dependence on these objects from the notation in what follows for notational 
simplicity. We use array asymptotics to better capture some finite-sample phenomena and 
to insure the robustness of conclusions with respect to perturbations of the data-generating 
process P along various sequences. This robustness, in turn, translates into uniform validity 
of confidence regions over certain regions of data-generating processes. 

We use the following empirical process notation, E n [/] := ~E n [f{uJi)] := Yl7=i /( w «)/ n > an( A 
:= Sr=i(/( w «) ~~ E[/(wj)])/\/n. Since we want to deal with i.n.i.d. data, we also 
introduce the average expectation operator: E[/] := EE n [/] = EE n [/((jj)] = X^=i 
The /2-norm is denoted by || • ||, and the Zo-norm, || • ||q, denotes the number of non-zero 
components of a vector. We use || • ||oo to denote the maximal element of a vector. Given a 
vector if £ M p , and a set of indices T C {1, . . . ,p}, we denote by 5t £ K p the vector in which 
5xj = Sj if j £ T, 5xj = if j ^ T. We use the notation (a) + = max{a, 0}, a V b = max{a, b}, 
and a A b = min{a, b}. We also use the notation a < b to denote a ^ cb for some constant 
c > that does not depend on n; and a <p b to denote a = Op(b). For an event E, we say 
that E wp — > 1 when E occurs with probability approaching one as n grows. Given a p- vector 
6, we denote support(fr) = {j G {1, : bj ^ 0}. 



2. Inference on Treatment and Structural Effects Conditional on 

Observables 

2.1. Framework. We consider the partially linear model 

yi = dia + g(zi) + Q, E[Q \ Zi, d { ] = 0, (2.2) 
di = m(zi) + Vi, E[vi | z^ = 0, (2.3) 

where yi is the outcome variable, di is the policy/treatment variable whose impact «o we would 
like to infer, Zi represents confounding factors on which we need to condition, and Q and Vi are 
disturbances. The parameter ao is the average treatment or structural effect under appropriate 
conditions given, for example, in Heckman, LaLonde, and Smith (1999) or Imbens (2004) and 
is of major interest in many empirical studies. 
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The confounding factors Z{ affect the policy variable via the function m(zi) and the outcome 
variable via the function g(zi). Both of these functions are unknown and potentially compli- 
cated. We use linear combinations of control terms Xi = P{zi) to approximate g(zi) and m(zi), 
writing fl2jZ]) and as 

Ui = dia + x'iPgo + r gi +Ci, (2.4) 

di = x'ifimO + r m i +Vi, (2.5) 
v * ' 

m(zi) 

where x'flgo and x^/3 m o are approximations to g(zi) and m(zi), and r g i and r m i are the corre- 
sponding approximation errors. In order to allow for a flexible specification and incorporation 
of pertinent confounding factors, the vector of controls, X\ = P(zi), can have a dimension 
p = p n which can be large relative to the sample size. Specifically, our results only require 
logp = o(n 1 / 3 ) along with other technical conditions. High-dimensional regressors Xi = P{zi) 
could arise for different reasons. For instance, the list of available controls could be large, i.e. 
Xi = zi as in e.g. Koenker (1988). It could also be that many technical controls are present; 
i.e. the list Xi = P(zi) could be composed of a large number of transformations of elementary 
regressors Zi such as B-splines, dummies, polynomials, and various interactions as in Newey 
(1997) or Chen (2007). 

Having very many controls creates a challenge for estimation and inference. A key condition 
that makes it possible to perform constructive estimation and inference in such cases is termed 
sparsity. Sparsity is the condition that there exist approximations x'^go and x'j/3 m o to g(zi) 
and m{zi) in (|2.4|) - f)2.5|) that require only a small number of non-zero coefficients to render the 
approximation errors r g i and r m j sufficiently small relative to estimation error. More formally, 
sparsity relies on two conditions. First, there exist (3 g o and (3 m o such that at most s = s n <C n 
elements of f3 m o and /3 g o are non-zero so that 

HAnollo < s and \\P g o\\o < s. 

Second, the sparsity condition requires the size of the resulting approximation errors to be 
small compared to the conjectured size of the estimation error: 

{Ef^]} 1 / 2 < VTJ^ and {E[rL]} 1/2 < V^- 

Note that the size of the approximating model s = s n can grow with n just as in standard 
series estimation. 

The high-dimensional-sparse-model framework outlined above extends the standard frame- 
work in the treatment effect literature which assumes both that the identities of the relevant 
controls are known and that the number of such controls s is much smaller than the sample 
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size. Instead, we assume that there are many, p, potential controls of which at most s controls 
suffice to achieve a desirable approximation to the unknown functions <?(•) and m(-) and allow 
the identity of these controls to be unknown. Relying on this assumed sparsity, we use selec- 
tion methods to select approximately the right set of controls and then estimate the treatment 
effect ao- 

2.2. The Method: Least Squares after Double Selection. We propose the following 
method for estimating and performing inference about ao- The most important feature of this 
method is that it does not rely on the highly unrealistic assumption of perfect model selection 
which is often invoked to justify inference after model selection. To the best of our knowledge, 
our result is the first of its kind in this setting. This result extends our previous results on 
inference under imperfect model selection in the instrumental variables model given in Belloni, 
Chen, Chernozhukov, and Hansen (2010). The problem is fundamentally more difficult in 
the present paper due to lack of adaptivity in estimation which we overcome by introducing 
additional model selection steps. The construction of our advocated procedure reflects our 
effort to offer a method that has attractive robustness/uniformity properties for inference. 
The estimator is -^/n-consistent and asymptotically normal under mild conditions and provides 
confidence intervals that are robust to various perturbations of the data-generating process 
that preserve approximate sparsity. 

To define the method, we first write the reduced form corresponding to (|2.2p -( j2T3j) as: 



We have two equations and hence can apply model selection methods to each equation to 
select control terms. The chief method we discuss is the Lasso method described in more 
detail below. Given the set of selected controls from (12. 6p and (|2.7h . we can estimate ao 
by a least squares regression of yi on di and the union of the selected controls. Inference 
on ao may then be performed using conventional methods for inference about parameters 
estimated by least squares. Intuitively, this procedure works well since we are more likely 
to recover key controls by considering selection of controls from both equations instead of 
just considering selection of controls from the single equation (|2.4h or (|2.6p . In finite-sample 
experiments, single-selection methods essentially fail, providing poor inference relative to the 
double-selection method outlined above. This performance is also supported theoretically by 



Vi = x'iPo +n + Ci, 



(2.6) 
(2.7) 



where /3 := a /? m o + P g o, U := a Q r mi + r gi , Q := a Vi + Q. 
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the fact that the double-selection method requires weaker regularity conditions for its validity 
and for attaining the efficiency boun than the single selection method. 

Now we formally define the post-double-selection estimator: Let I\ = support(/3i) denote 
the control terms selected by a feasible Lasso estimator fa computed using data (yi,Xi) = 
(di, Xi), i = 1, n. Let I2 = support^) denote the control terms selected by a feasible Lasso 
estimator fa computed using data (yi,Xi) = (yi,X{), i = 1, ...,n. The post-double-selection 
estimator a of ao is defined as the least squares estimator obtained by regressing y^ on di and 
the selected control terms Xij with j G I D Ii U 

(a, P) = argmin {E n [(y t - d { a - x'^f] : fa = 0, Vj /}. (2.8) 

The set I may contain variables that were not selected in the variable selection steps with 
indices in ^3 that the analyst thinks are important for ensuring robustness. We call I3 the 
amelioration set. Thus, I = I\ U 1% U I3; let ~s = \ I\ and s", = \Ij\ for j = 1,2,3. 

We define feasible Lasso estimators below and note that other selection methods could be 
used as well. When a feasible Lasso is used to construct I\ and I2, we refer to the post-double- 
selection estimator as the "post- double-Lasso estimator. When other model selection devices 
are used to construct I = I\ and I2, we shall refer the estimator as the generic post-double- 
selection estimator. 

The main theoretical result of the paper shows that the post-double-selection estimator a 
obeys 

(r>?]- 1 %?< t ?][E« i 2 ]- 1 )- 1 / 2 V^(6-ao)^iV(0 > l) (2.9) 

under approximate sparsity conditions, uniformly within a rich set of data generating pro- 
cesses. We also show that the standard plug-in estimator for standard errors is consistent 
in these settings. All of these results imply uniform validity of confidence regions over large, 
interesting classes of models. Figure 12.21 (right panel) illustrates the result (|2.9p by showing 
that the finite-sample distribution of our post-double-selection estimator is very close to the 
normal distribution. In contrast, Figure [2T21 (left panel) illustrates the classical problem with 
the traditional post-single-selection estimator based on (|2.4|) . showing that its distribution is 
bimodal and sharply deviates from the normal distribution. Finally, it is worth noting that 
the estimator achieves the semi-parametric efficiency bound under homoscedasticity. 



Semi-parametric efficiency is attained in the homoscedastic case. 
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Distributions of Studentized Estimators 

post-single-selection estimator post-double-selection estimator 



o 
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FIGURE 1 . The finite-sample distributions (densities) of the standard post-single selection 
estimator (left panel) and of our proposed post-double selection estimator (right panel). The 
distributions are given for centered and studentized quantities. The results are based on 10000 
replications of Design 1 described in Section 6, with R 2 's in equation (|2.6[) and (|2.7[) set to 

0.5. 

2.3. Selection of controls via feasible Lasso Methods. Here we describe feasible variable 
selection via Lasso. Note that each of the regression equations above is of the form 

where f{zi) is the regression function, x^Po is the approximation based on the dictionary 
x% = P(zi), Ti is the approximation error, and ej is the error. The Lasso estimator is defined 
as a solution to 

minE n [(^-x' i /3) 2 ] + -||/3|| 1 , (2.10) 

/3skp n 

where = Yl^=i see (Frank and Friedman, 1993) and (Tibshirani, 1996). The kinked 
nature of the penalty function induces the solution f3 to have many zeroes, and thus the Lasso 
solution may be used for model selection. The selected model T = support(/3) is often used for 
further refitting by least squares, leading to the so called post-Lasso or Gauss-Lasso estimator, 
see, e.g., Belloni and Chernozhukov (2011b). The Lasso estimator /selector is computationally 
attractive because it minimizes a convex function. In the homoskedastic Gaussian case, a basic 
choice for penalty level suggested by Bickel, Ritov, and Tsybakov (2009) is 

A = 2 - co- A/2nlog(2p/7), (2.11) 
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where c > 1, 1 — 7 is a confidence level that needs to be set close to 1, and a is the standard 
deviation of the noise. The formal motivation for this penalty is that it leads to near-optimal 
rates of convergence of the estimator under approximate sparsity. The good behavior of the 
estimator of /?o in turn implies good approximation properties of the selected model T, as 
noted in Belloni and Chernozhukov (2011b). Unfortunately, even in the homoskedastic case 
the penalty level specified above is not feasible since it depends on the unknown a. 

Belloni, Chen, Chernozhukov, and Hansen (2010) formulate a feasible Lasso estimator/selector 
j3 geared for heteroscedastic, non- Gaussian cases, which solves 



where \I/ = diag(/i, . . . , l p ) is a diagonal matrix of penalty loadings. The penalty level A and 
loadings Zj's are set as 



A = 2 • c\fn<$> 1 (1 - j/2p) and lj = lj + op(l), lj = JE n [xf-ef], uniformly in j = 1, . . . ,p, 



where c > 1 and 1 — 7 is a confidence level! I The L-'s are ideal penalty loadings that are 
not observed, and we estimate lj by lj obtained via an iteration method given in Appendix 
A. We refer to the resulting feasible Lasso method as the Iterated Lasso. The estimator /? 
has statistical performance that is similar to that of the (infeasible) Lasso described above in 
Gaussian cases and delivers similar performance in non-Gaussian, heteroscedastic cases; see 
Belloni, Chen, Chernozhukov, and Hansen (2010). In this paper, we only use f3 as a model 
selection device. Specifically, we only make use of 



the labels of the regressors with non-zero estimated coefficients. We show that the selected 
model T has good approximation properties for the regression function / under approximate 
sparsity in Section 3. 

Belloni, Chernozhukov, and Wang (2011) propose another feasible variant of Lasso called 
the Square-root Lasso estimator, /3, defined as a solution to 






(2.13) 



T = support(/3) 




(2.14) 



with the penalty level 



A = c- -s/n®- 1 ^ -7/2p) 



(2.15) 



Practical recommendations include the choice c = 1.1 and 7 = .05. 
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where c > 1, 7 € (0, 1) is a confidence level, and V& = diag(/i, . . . ,l p ) is a diagonal matrix of 
penalty loadings. The main attractive feature of (|2. 14[) is that one can set lj = {E n [x? J ]} 1 / 2 
which depends only on observed data in the homoscedastic case. 

In the heteroscedastic case, we would like to choose lj so that 

lj + op(1) < Tj <P lj, where lj = {E^-e^/E^e 2 ]} 1 / 2 , uniformly in j = 1, ...,p. (2.16) 
As a simple bound, we could use lj = 2{E n [x^]} 1//4 since 

{E^e 2 ]]^ 2 ]} 1 / 2 < {E^^.]} 1 ^^^^}!^/!^^]}!^. 

This bound gives lj + o P (l) ^ Tj if {E n [e^]} 1 / 4 /{E n [e 2 ]} 1 / 2 ^ 2 + o P (l), which covers a wide 
class of marginal distributions for error ej. For example, all t-distributions with degrees of 
freedom greater than five satisfy this condition. As in the previous case, we can also iteratively 
re-estimate the penalty loadings using estimates of the e^'s to approximate the ideal penalty 
loadings: 

lj = lj + op(l), uniformly in j = 1, ...,p. (2-17) 

The resulting Square-root Lasso and post-Square-root Lasso estimators based on these penalty 
loadings achieve near optimal rates of convergence even in non-Gaussian, heteroscedastic cases. 
This good performance implies good approximation properties for the selected model T. 

In what follows, we shall use the term feasible Lasso to refer to either the Iterated Lasso 
estimator /3 solving (|2.12p - (|2.13|) or the Square-root Lasso estimator j3 solving (|2.14|) - (l2.16p 
with c > 1 and 1 — 7 set such that 

7 = o(l) and log(l/7) < log(p V n). (2.18) 

3. Theory of Estimation and Inference 

3.1. Regularity Conditions. In this section, we provide regularity conditions that are suf- 
ficient for validity of the main estimation and inference result. We begin by stating our main 
condition, which contains the previously defined approximate sparsity as well as other more 
technical assumptions. Throughout the paper, we let c, C, and q be absolute constants, and 
let £ n /• 00, 5 n \ 0, and A n \ be sequences of absolute positive constants. By absolute 
constants, we mean constants that are given, and do not depend the dgp P = P n . 

We assume that for each n the following condition holds on dgp P = P n . 

Condition ASTE (P). (i) {(yi,di, Zi),i = l,...,n} are i.n.i.d. vectors on (OjJ 7 , P) that 
obey the model \2.2*) - l2lty . and the vector Xi = P{zi) is a dictionary of transformations of Zi, 
which may depend on n but not on P. (ii) The true parameter value ao, which may depend 
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on P, is bounded, \cto\ ^ C. (Hi) Functions m and g admit an approximately sparse form. 
Namely there exists s ^ 1 and f3 m Q and /3 g o, which depend on n and P, such that 



s 3 < C(l V si V s 2 ). (v) For Vi = v { + r mi and Q = Q + r gi we have \E[v?(f] - E[vf(?]\ < 6 n , 
and E[|£j| 9 + \Ci\ q ] ^ C for some q > 4. Moreover, maxj^ n 1 1 1 1 srz - 1/2+2/g ^ ^ W p \ _ 

Comment 3.1. The approximate sparsity (iii) and the growth condition (iv) are the main 
conditions for establishing the key inferential result. We present a number of primitive examples 
to show that these conditions contain standard models used in empirical research as well as 
more flexible models. Condition (iv) requires that the size S3 of the amelioration set I3 should 
not be substantially larger than the size of the set of variables selected by the Lasso method. 
Simply put, if we decide to include controls in addition to those selected by Lasso, the total 
number of additions should not dominate the number of controls selected by Lasso. This 
and other conditions will ensure that the total number s" of controls obeys s <p s, and we 
also require that s 2 log 2 (p V n)/n — > 0. This condition can be relaxed using the sample- 
splitting method of Fan, Guo, and Hao (2011), which is done in the Supplementary Appendix. 
Condition (v) is simply a set of sufficient conditions for consistent estimation of the variance of 
the double selection estimator. If the regressors are uniformly bounded and the approximation 
errors are going to zero a.s., it is implied by other conditions stated below; and it can also be 
demonstrated under other sorts of more primitive conditions. □ 

The next condition concerns the behavior of the Gram matrix E n [xja^]. Whenever p > n, the 
empirical Gram matrix E„[a;jX / j] does not have full rank and in principle is not well-behaved. 
However, we only need good behavior of smaller submatrices. Define the minimal and maximal 
m-sparse eigenvalue of a semi-definite matrix M as 



To (m)[E n [xjx'j]] > requires that all empirical Gram submatrices formed 

by any m components of Xi are positive definite. We shall employ the following condition as a 
sufficient condition for our results. 

Condition SE (P). There is an absolute sequence of constants £ n — > 00 such that the 
maximal and minimal £ n s- sparse eigenvalues are bounded from below and away from zero, 
namely with probability at least 1 — A n , 



(iv) The sparsity index obeys s 





S'MS 



S'MS 
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where < k' < k" < oo are absolute constants. 

Comment 3.2. It is well-known that Condition SE is quite plausible for many designs of 
interest. For instance, Condition SE holds if 

(a) Xi, i = 1, . . . , n, are i.i.d. zero-mean sub-Gaussian random vectors that have population 
Gram matrix EfxjX^] with minimal and maximal slogn-sparse eigenvalues bounded 
away from zero and from above by absolute constants where s(logn)(logp)/n ^ 5 n — > 0; 

(b) Xi, i = 1, ...,n, are i.i.d. bounded zero-mean random vectors with ||xj||oo ^ K n 
a.s. that have population Gram matrix E[xjX^] with minimal and maximal slogn- 
sparse eigenvalues bounded from above and away from zero by absolute constants 
where K^s(log 3 n){\og{p V n)}/n ^ 5 n — > 0. 

The claim (a) holds by Theorem 3.2 in Rudelson and Zhou (2011) (see also Zhou (2009) 
and Baraniuk, Davenport, DeVore, and Wakin (2008)) and claim (b) holds by Lemma 1 in 
Belloni and Chernozhukov (2011b) or by Theorem 1.8 Rudelson and Zhou (2011). Recall 
that a standard assumption in econometric research is to assume that the population Gram 
matrix E[xjX^] has eigenvalues bounded from above and away from zero, see e.g. Newey (1997). 
The conditions above allow for this and more general behavior, requiring only that the s log n 
sparse eigenvalues of the population Gram matrix E[xjX^] are bounded from below and from 
above. □ 

The next condition imposes moment conditions on the structural errors and regressors. 

Condition SM (P). There are absolute constants < c < C < oo and 4 < q < oo such 
that for (jji,ei) = (j/i,Ct) an d {Vii^i) = (di,Vi) the following conditions hold: 



These conditions, which are rather mild, ensure good model selection performance of feasible 
Lasso applied to equations (|2.6[) and (12. 7ft . These conditions also allow us to invoke moderate 
deviation theorems for self-normalized sums from Jing, Shao, and Wang (2003) to bound some 
important error components. 



(i) E[\di\ q ] < C, c < E[(? | Xi,Vi] ^ C and c ^ E[vf \ x { ] ^ C a.s. l^i^n 

(ii) E[\e^} + E[yj] + max^E^ 2 ] + E[|x?-ef|] + l/Efx 2 -]} ^ C, 

(hi) log 3 p/n < 6 n , 




< 5 n wp 1 - A. 
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3.2. The Main Result. The following is the main result of this paper. It shows that the 
post-double selection estimator is root-n consistent and asymptotically normal. Under ho- 
moscedasticity this estimator achieves the semi-parametric efficiency bound. The result also 
verifies that plug- in estimates of the standard errors are consistent. 

Theorem 1 (Estimation and Inference on Treatment Effects). Let {P n } be a sequence of data- 
generating processes. Assume conditions ASTE (P), SM (P), and SE (P) hold for P = P n for 
each n. Then, the post- double- Lasso estimator a, constructed in the previous section, obeys as 
n — > oo 

o-~ l y/n(a — cto) ~~> N(0, 1), 

where a 2 = [Eu?] _1 E[t> f(f] [Ei;?] -1 . Moreover, the result continues to apply if 'a 2 is replaced by 
al = [E„^]-iE n [^][E n £2]-i ; f or q .- [ y . _ d . & _ x' 4 /3]{n/(n - s- I)} 1 ' 2 and v { := d { - x'fi, 
i = 1, . . . ,n where (3 E argmhig{E n , [(<ii — x\fi) 2 \ : fij = 0, Vj ^ /}. 

A consequence of this result is the following corollary. 

Corollary 1 (Uniformly Valid Confidence Intervals), (i) LetP n be the collection of all 
data-generating processes P for which conditions ASTE(P ), SM (P ), and SE (P ) hold for given 
n. Let c(l — £) = <I> _1 (1 — £/2). Then as n — >■ oo, uniformly in P £ P n 

P (a E [a ± c(l - i)d n /M) -> 1 - £■ 

(ii) Let P = n n j> no P n be the collection of data- generating processes for which the conditions 
above hold for all n ^ uq for some no- Then as n — >■ oo, uniformly in P E P 

P (a E [a ± c(l - O^n/v 7 ^]) -+ 1 - f ■ 

By exploiting both equations (|2.4|) and (|2.5|) for model selection, the post-double-selection 
method creates the necessary adaptivity that makes it robust to imperfect model selection. 
Robustness of the post-double selection method is reflected in the fact that Theorem [T] permits 
the data-generating process to change with n. Thus, the conclusions of the theorem are valid 
for a wide variety of sequences of data-generating processes which in turn define the regions P 
of uniform validity of the resulting confidence sets. These regions appear to be substantial, as 
we demonstrate via a sequence of theoretical and numerical examples in Section 5 and 6. In 
contrast, the standard post-selection method based on (|2.4p generates non-robust confidence 
intervals. 

Comment 3.3. Our approach to uniformity analysis is most similar to that of Romano (2004), 
Theorem 4. It proceeds under triangular array asymptotics, with the sequence of dgps obeying 
certain constraints; then these results imply uniformity over sets of dgps that obey the con- 
straints for all sample sizes. This approach is also similar to the classical central limit theorems 
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for sample means under triangular arrays, and does not require the dgps to be parametrically 
(or otherwise tightly) specified, which then translates into uniformity of confidence regions. 
This approach is somewhat different in spirit to the generic uniformity analysis suggested by 



Comment 3.4. Uniformity holds over a large class of approximately sparse models, which 
cover conventional models used in series estimation of partially linear models as shown in 
Section 5. Of course, for every interesting class of models and any inference method, one 
could find an even bigger class of models where the uniformity does not apply. In particular, 
our models do not cover models with many small coefficients. In the series case, a model 
with many small coefficients corresponds to a deviation from smoothness towards highly non- 
smooth functions, namely functions generated as realized paths of an approximate white noise 
process. The fact that our results do not cover such models motivates further research work 
on inference procedures that have robustness properties to deviations from the given class of 
models that are deemed important. In the simulations in Section 6, we consider incorporating 
the ridge fit along the other controls to be selected over using lasso to build extra robustness 
against "many small coefficients" deviations away from approximately sparse models. □ 

3.3. Auxiliary Results on Model Selection via Lasso and Post-Lasso. The post- 
double-selection estimator applies the least squares estimator to the union of variables selected 
for equations (|'2.6p and (|2.7|) via feasible Lasso. Therefore, the model selection properties of 
feasible Lasso as well as properties of least squares estimates for m and g based on the selected 
model play an important role in the derivation of the main result. The purpose of this section 
is to describe these properties. The proof of Theorem 1 relies on these properties. 

Note that each of the regression models (|2.6p - (|2.7p obeys the following conditions. 

Condition ASM. Let {P n } be a sequence of data- generating processes. For each n, we 
have data {(yi,ii,Xi = P(zi)) : 1 ^ i ^ n} defined on (Q,A,P n ) consisting of i.n.i.d vectors 
that obey the following approximately sparse regression model for each n: 



Andrews, Cheng, and Guggenberger (2011). 



□ 



Vi = f{zi) + e; = x'ifio + n + e» 



E[e, | xi] =0,E[ef] 



= o~ 



2 




2 



.2 



Let T denote the model selected by the feasible Lasso estimator /3: 



f = support(3) = 0'e{l J ...,p} : %\ >0}, 
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The Post-Lasso estimator ft is is ordinary least squares applied to the data after removing the 
regressors that were not selected by the feasible Lasso: 

ft E arg mm - x t ft) 2 ] : ftj = for each j $ f. (3.22) 

The following regularity conditions are imposed to deal with non-Gaussian, heteroscedastic 
errors. 

Condition RF. In addition to ASTE, we have 

(i) log 3 p/n — > and slog(p V n)/n — > 0, 

(ii) E[yj] + m^ j<p {E[x^] + E[|£?- e f]] + l/E^ef]} < 1, 

(iii) max {|(E n - E)[^]\ + |(E n - E)[^\\} + max p.^ ^M^ Vp) = 

The main auxiliary result that we use in proving the main result is as follows. 

Lemma 1 (Model Selection Properties of Lasso and Properties of Post-Lasso). Let {P n } be 
a sequence of data-generating processes. Suppose that conditions ASM and RF hold, and that 
Condition SE (P n ) holds for E n [xjX^]. Consider a feasible Lasso estimator with penalty level 
and loadings specified as in Section 3.3. 

(i) Then the data- dependent model T selected by a feasible Lasso estimator satisfies with 
probability approaching 1: 

s = \f\<s (3.23) 
and 

mm ^ JE n[ m)-m 2 < a J^MA. (3.24) 

/3eRP: /3j=0 Vj&T v V n 

(ii) The Post-Lasso estimator obeys 



and 



s log(p V n) 



n 



M <p VEn[{^/3-^/3o} 2 ] <p ^V— ^ ( 3 - 25 ) 



s \og{p V n) 



n 



Lemma[T]was derived in Belloni, Chen, Chernozhukov, and Hansen (2010) for Iterated Lasso 
and by Belloni, Chernozhukov, and Wang (2010) for Square-root Lasso. These analyses build 
on the rate analysis of infeasible Lasso by Bickel, Ritov, and Tsybakov (2009) and on sparsity 
analysis and rate analysis of Post-Lasso by Belloni and Chernozhukov (2011b). Lemma[T]shows 
that feasible Lasso methods select a model T that provides a high-quality approximation to 
the regression function f(zi); i.e. they find a sparse model that can approximate the function 
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at the "near-oracle" rate y/s/n-^/\og(p V nj. If we knew the "best" approximating model 
T = support(/3o), we could achieve the "oracle" rate of \f~sfn. Note that Lasso methods 
generally will not recover T perfectly. Moreover, no method can recover T perfectly in general, 
except under the restrictive condition that all non-zero coefficients in /3o are bounded away 
from zero by a factor that exceeds estimation error. We do not require this condition to hold 
in our results. All that we need is that the selected model T can approximate the regression 
function well and that the size of the selected model, s = \T\, is of the same stochastic order 
as s = \T\. This condition holds in many cases in which some non-zero coefficients are close to 
zero. 

The lemma above also shows that feasible Post-Lasso achieves the same near-oracle rate 
as feasible Lasso. The coincidence in rates occurs despite the fact that feasible Lasso will in 
general fail to correctly select the best-approximating model T as a subset of the variables 
selected; that is, T %T '. The intuition for this result is that any components of T that feasible 
Lasso misses are unlikely to be important; otherwise, (|3.24p would be impossible. This result 
was first derived in the context of median regression by Belloni and Chernozhukov (2011a) and 
extended to least squares in reference cited above. 

4. Generalization: Inference after Double Selection by a Generic Selection 

Method 

The conditions provided so far are simply a set sufficient conditions that are tied to the use 
of Lasso as the model selector. The purpose of this section is to prove that the main results 
apply to any other model selection method that is able to select a sparse model with good 
approximation properties. As in the case of Lasso, we allow for imperfect model selection. 
Next we state a high-level condition that summarizes a sufficient condition on the performance 
of a model selection method that allows the post-double selection estimator to attain good 
inferential properties. 

Condition HLMS (P). A model selector provides possibly data- dependent sets I\ U Ii C 
I C {1, of covariate names such that, with probability 1 — A n , \I\ ^ Cs and 

min ^ JE n [(m{ Zi ) - x'^f] ^ ^n" 1 / 4 and min ^ J^ n [(g{ Zi ) - x'fif] ^ 5 n n-^\ 

Condition HLMS requires that with high probability the selected models are sparse and 
generates a good approximation for the functions g and m. Examples of methods producing 
such models include the Dantzig selector (Candes and Tao, 2007), feasible Dantzig selector 
(Gautier and Tsybakov, 2011), Bridge estimator (Huang, Horowitz, and Ma, 2008), SCAD 
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penalized least squares (Fan and Li, 2001), and thresholded Lasso (Belloni and Chernozhukov, 
2011b), to name a few. We emphasize that, similarly to the previous arguments, we allow for 
imperfect model selection. 

The following result establishes the inferential properties of a generic post-double-selection 
estimator. 

Theorem 2 (Estimation and Inference on Treatment Effects under High-Level Model Selec- 
tion). Let {P n } be a sequence of data- generating processes and the model selection device be 
such that conditions ASTE (P), SM (P), SE (P), and HLSM(P) hold for P = P n for each n. 
Then the generic post-double-selection estimator a based on I, as defined in A2.8\) , obeys 

([Ev^nvum^]- 1 )' 112 ^ - oo) - n(o, i). 

Moreover, the result continues to apply ifE[vf] andE[v?£f] are replaced byK n \vf] and E, n \vf^] 
for Q := [j/t — did, — x'j/3]{n/(n — s — l)} 1 ^ 2 and Vi := di — i = 1, ... ,n where /3 G 
axgmin^{E n [(d i - x',/3) 2 ] : ft = 0, Vj $ /}. 

Theorem [2] can also be used to establish uniformly valid confidence intervals as shown is the 
following corollary. 

Corollary 2 (Uniformly Valid Confidence Intervals), (i) LetP n be the collection of all 
data-generating processes P for which conditions ASTE(P), SM (P), SE (P), and HLSM (P) 
hold for given n. Let c(l — £) = $ _1 (1 — £/2). Then asn-y oo, uniformly in P £ P n 

P (a €[&± c(l - i)d n /M) 1 - €■ 

(ii) Let P = n n ^ no P n be the collection of data- generating processes for which the conditions 
above hold for all n ^ uq for some no- Then as n — > oo, uniformly in P € P 

P (a G [a ± c(l - t)°n/M) 1 - C 

5. Theoretical Examples 

The purpose of this section is to give a sequence of examples - progressing from simple 
to somewhat involved - that highlight the range of the applicability and robustness of the 
proposed method. In these examples, we specify primitive conditions which cover a broad 
range of applications including nonparametric models and high-dimensional parametric models. 
We emphasize that our main regularity conditions cover even more general models which 
combine various features of these examples such as models with both nonparametric and high- 
dimensional parametric components. 
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In all examples, the model is 



Hi = dia + g(zi) + &, E[& \ Zi,Vi] = 0, 
di = m{zi) + Vi, E[vi | Zi] = 0, 



(5.26) 



however, the structure for g and m will vary across examples, and so will the assumptions on 
the error terms Q and v^. 

We start out with a simple example, in which the dimension p of the regressors is fixed. 
In practical terms this example approximates cases with p small compared to n. This sim- 
ple example is important since standard post-single-selection methods fail even in this simple 
case. Specifically, they produce confidence intervals that are not valid uniformly in the un- 
derlying data-generating process; see Leeb and Potscher (2008). In contrast, the post-double- 
selection method produces confidence intervals that are valid uniformly in the underlying data- 
generating process. 

Example 1. (Parametric Model with Fixed p.) Consider (fi, A, P) as the probability space, 
on which we have (yi, Zi,di) as i.i.d. vectors for i = 1, n obeying the model (|5.26p with 



For estimation we use Xi = (zij,j = 1, ...,p)' . We assume that there are some absolute constants 
0<b<B< oo, q x ^q> A, with A/q x + A/q < 1, such that 



Let P be the collection of all regression models P that obey the conditions set forth above 
for all n for the given constants (p,b,B,q x ,q). Then, as established in Appendix [FJ any 
PeP obeys Conditions ASTE (P) with s = p, SE (P), and SM (P) for all n ^ n , with the 
constants no and (k', k" , c, C) and sequences A n and S n in those conditions depending only on 
(p, b, B, q x , q). Therefore, the conclusions of Theorem 1 hold for any sequence P n G P, and the 
conclusions of Corollary 1 on the uniform validity of confidence intervals apply uniformly in 

PeP. □ 

The next examples are more substantial and include infinite-dimensional models which we 
approximate with linear functional forms with potentially very many regressors, p 3> n. The 
key to estimation in these models is a smoothness condition which requires regression coeffi- 
cients to decay at some rates. In series estimation, this condition is often directly connected 
to smoothness of the regression function. 




(5.27) 



E[\\xi\\i*} < B, \\a \\ + \\/3 g0 \\ + ||/3 m0 || «S B, b ^ \ min (E[x i x' i \) 
b^E[Cf\xi,Vi], E[\(f\ | Xi,Vi] ^ B, b^E[vf\ Xi ], E[|^| | x 




(5.28) 
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Let a and A be positive constants. We shall say that a sequence of coefficients 



e = {o j j = 1,2,... 



} 



is a-smooth with constant A if 



%K Aj~ a , j = 1,2,..., 



which will be denoted as G S\. We shall say that a sequence of coefficients 9 = {6j,j = 
1,2,...} is a-smooth with constant A after ^-rearrangement if 



ment of the numbers {\9j\, j = 1, ...,p}. Since S\ C S\{p), the second kind of smoothness is 
strictly more general than the first kind. 

Here we use the term "smoothness" motivated by Fourier series analysis where smoothness of 
functions often translates into smoothness of the Fourier coefficients in the sense that is stated 
above; see, e.g., Kerkyacharian and Picard (1992). For example, if a function h : [0, l] d i— > 1R 
possesses r > continuous derivatives uniformly bounded by a constant M and the terms Pj are 
compactly supported Daubechies wavelets, then h can be represented as h(z) = Yl'jLi Pj{ z )^hj-, 
with \e hj \ Aj~ r / d - l l 2 for some constant A; see Kerkyacharian and Picard (1992). We also 
note that the second kind of smoothness is considerably more general than the first since it 
allows relatively large coefficients to appear anywhere in the series of the first p coefficients. In 
contrast, the first kind of smoothness only allows relatively large coefficients among the early 
terms in the series. Lasso-type methods are specifically designed to deal with the generalized 
smoothness of the second kind and perform equally well under both kinds of smoothness. In 
the context of series applications, smoothness of the second kind allows one to approximate 
functions that exhibit oscillatory phenomena or spikes, which are associated with "high order" 
series terms. An example of this is the wage function example given in Belloni, Chernozhukov, 
and Hansen (2011a). 

Before we proceed to other examples we discuss a way to generate sparse approximations 
in infinite-dimensional examples. Consider, for example, a function h that can be represented 



sparse approximations by simply thresholding to zero all coefficients smaller than l/\/n and 
with indices j ^ p. This generates a sparsity index s ^ A^n^ . The non-zero coefficient could 
be further reoptimized by using the least squares projection. More formally, given a sparsity 
index s > 0, a target function h(zi), and terms x.; = (Pj(zi) : j = 1, . . . ,p)' £ W, we let 




l,...,p} denotes the decreasing rearrange- 



j=p + l,p + 2,... 




coefficients 0^ E S^(p). In this case we can construct 



(3 h0 := arg min E[(h(zi) - x'^f] 



(5.29) 
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and define x^Pho as the best s-sparse approximation to h(zj). 

Example 2. (Gaussian Model with Very Large p.) Consider (fl,A,P) as the probability 
space on which we have (yi,Zi,di) as i.i.d. vectors for i = 1, ...,n obeying the model (|5.26p 
with 

9( z i) = Y^j=i6gj z iji (5 301 

Assume that the infinite dimensional vector wi = {z'^d^Vi)' is jointly Gaussian with minimal 
and maximal eigenvalues of the matrix (operator) E[u/jU^] bounded below by an absolute 
constant k > and above by an absolute constant k < oo. 

The main assumption that guarantees approximate sparsity is the smoothness condition on 
the coefficients. Let a > 1 and < A < oo be some absolute constants. We require that the 
coefficients of the expansions in (|5.30p are a-smooth with constant A after p-rearrangement, 
namely 

m = (0 mj ,j = 1,2,...) G S%(p), 9 g = (9 gj ,j = 1,2,...) G S a A (p). 
For estimation purposes we shall use Xi = (zij,j = l,...,p)', and assume that ||ao|| ^ B and 
p = p n obeys 

n [(i-o)/o]+ x log 2( p V n) ^ 4, A l / a n^ ^ P 5 n , and log 3 p/n ^ 6 n , 

for some absolute sequence S n \, and absolute constants B and x > 0. 

Let P n be the collection of all dgp P that obey the conditions set forth in this example for 
a given n and for the given constants (k, k, a, A, B, x) and sequences p = p n and 5 n . Then, as 
established in Appendix HH any P G P n obeys Conditions ASTE (P) with s = A l / a n^, SE 
(P), and SM (P) for all n ^ no, with constants uq and {k! , k" , c, C) and sequences A n and 5 n in 
those conditions depending only on (re, R, a, A, B, x), p, and 5 n . Therefore, the conclusions of 
Theorem 1 hold for any sequence P n G P n , and the conclusions of Corollary 1 on the uniform 
validity of confidence intervals apply uniformly for any P G P n . In particular, these conclusions 
apply uniformly in P G P = H n ^ no P n . □ 

Example 3. (Series Model with Very Large p.) Consider (fi, A, P) as the probability space, 
on which we have (?/j, Zi,di) as i.i.d. vectors for i = 1, n obeying the model: 

™>(Zi)= Y^ ( jLi G mjPj(z i ), 

where Z{ has support [0, l] d with density bounded from below by constant / > and above by 
constant /, and {Pj,j = 1,2, ..} is an orthonormal basis on L 2 [0, l] d with bounded elements, 
i.e. max zg [ 0i i]d |ij(z)| ^ B for all j = 1,2,.... Here all constants are taken to be absolute. 
Examples of such orthonormal bases include canonical trigonometric bases. 
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Let a > 1 and < A < oo be some absolute constants. We require that the coefficients of 
the expansions in (|5,31|) are a-smooth with constant A after p-rearrangement, namely 

9 m = (0 mj ,j = 1,2,...) G S%(p), 9 g = (0 gj ,j = 1,2,...) G S a A (p). 

For estimation purposes we shall use Xi = (Pj(zi),j = l,...,p)', and assume that p = p n 
obeys 

n (1_a)/a log 2 (p Vn) ^ £ n , ,4 1/a ra^^p<5 n and log 3 p/n < 5 n , 

for some sequence of absolute constants <5 n \ 0. We assume that there are some absolute 
constants b > 0, B < oo, (/ > 4, with (1 — a) /a + 4/q < 0, such that 

||oo|| < B, 6 ^ E[C 4 2 | Xi,«i], E[|C?| | Xi,^] ^B, E[vf \ x t ], E[|^ 9 | | x £ ] < B. 

(5.32) 

Let P n be the collection of all regression models P that obey the conditions set forth above 
for a given n. Then, as established in Appendix [FJ any P G P n obeys Conditions ASTE 
(P) with s = A l l a nv^, SE (P), and SM (P) for all n ^ no, with absolute constants in those 
conditions depending only on (/, /, a, A, b, B, q) and 5 n . Therefore, the conclusions of Theorem 
1 hold for any sequence P n G P n , and the conclusions of Corollary 1 on the uniform validity 
of confidence intervals apply uniformly for any P G P n . In particular, as a special case, the 
same conclusion applies uniformly in P G P = n n j, no P n . □ 

6. Monte-Carlo Examples 

In this section, we examine the finite-sample properties of the post- double-selection method 
through a series of simulation exercises and compare its performance to that the standard post- 
single-selection method. 

All of the simulation results are based on the structural model 

Di = d' i aQ + x' i 9 g + cFy(di,Xi)C,i, Q~N(0,1) (6.33) 

where p = dim(xj) = 200, the covariates ~ AT(0, X) with X^- = (0.5)^~ k ^, ao = .5, and the 
sample size n is set to 100. In each design, we generate 

<k = x'fim + a d (xi)vi, Vi ~ N(0, 1) (6.34) 

with E[£jt>j] = 0. Inference results for all designs are based on conventional t-tests with standard 
errors calculated using the heteroscedasticity consistent jackknife variance estimator discussed 
in MacKinnon and White (1985). Another option would be to use the standard error estimator 
recently proposed in Cattaneo, Jansson, and Newey (2010). 
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We report results from three different dgp's. In the first two dgp's, we set 9 g j = c y (3oj and 
9m,j = CdPo,j with f3oj = (1/j) 2 for j = 1, ...,200. The first dgp, which we label "Design 1," uses 
homoscedastic innovations with cr y = cr^ = 1. The second dgp, "Design 2," is heteroscedastic 

With ff «M = \j E^t+^oV aIld = \/ E^t+ao J+lf^ - The COIlstants C 2/ aIld C d are cllOSen 

to generate desired population values for the reduced form i? 2 's, i.e. the i? 2 's for equations 
p. 6p and (|2.7p . For each equation, we choose c y and Cd to generate i? 2 = 0, .2, .4, .6, and 
.8. In the heteroscedastic design, we choose c y and Cd based on R 2 as if (16.33P and (I6.34P 
held with V{ and Q homoscedastic and label the results by R 2 as in Design 1. In the third 
design ("Design 3"), we use a combination of deterministic and random coefficients. For the 
deterministic coefficients, we set 9 g j = c y (l/j) 2 for j < 5 and 6 m j = Cd(l/j) 2 for j < 5. We 
then generate the remaining coefficients as iid draws from (9 gt j,9 m j)' ~ ^(^2xi, (^-/p)h)- For 
each equation, we choose c y and Cd to generate R 2 = 0, .2, .4, .6, and .8 in the case that all 
of the random coefficients were exactly equal to and label the results by R 2 as in Design 1. 
We draw new x's, C's, and w's at every simulation replication, and we also generate new 9's at 
every simulation replication in Design 3. 

We consider Designs 1 and 2 to be baseline designs. These designs do not have exact 
sparse representations but have coefficients that decay quickly so that approximately sparse 
representations are available. Design 3 is meant to introduce a modest deviation from the 
approximately sparse model towards a model with many small, uncorrelated coefficients. Using 
this we shall document that our proposed procedure still performs reasonably well, although 
it could be improved by incorporation of a ridge fit as one of regressors over which selection 
occurs. In a working paper version of this paper Belloni, Chernozhukov, and Hansen (2011b), 
we present results for 26 additional designs. The results presented in this section are sufficient 
to illustrate the general patterns from the larger set of results!^! 

We report results for five different procedures. Two of the procedures are infeasible bench- 
marks: Oracle and Double-Selection Oracle estimators, which use of knowledge of the true 
coefficient structures g and 9 m and are thus unavailable in practice. The Oracle estimator is 
the ordinary least squares of yi — x'fig on di, and the Double-Selection Oracle is the ordinary 
least squares of y — x\9 g on dj — x\Q m . The other procedures we consider are feasible. In all 
of them, we rely on Lasso and set A according to the algorithm outlined in Appendix A with 
1 — 7 = .95. One procedure is the standard post-single selection estimator - the Post-Lasso 

In particular, the post-double-Lasso performed very well across all simulations designs where approxi- 
mate sparsity provides a reasonable description of the dgp. Unsurprisingly, the performance deteriorates as 
one deviates from the smooth/approximately sparse case. However, in no design was the post-double-Lasso 
outperformed by other feasible procedures. In extensive initial simulations, we also found that Square-Root 
Lasso and Iterated Lasso performed very similarly and thus only report Lasso results. 
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- which applies Lasso to equation (|6.33p without penalizing a, the coefficient on d, to select 
additional control variables from among x. Estimates of ao are then obtained by OLS regres- 
sion of y on d and the set of additional controls selected in the Lasso step and inference using 
the Post-Lasso estimator proceeds using conventional heteroscedasticity robust OLS inference 
from this regression. Post-Double-Selection or Post-Double-Lasso is the feasible procedure 
advocated in this paper. We run Lasso of y on x to select a set of predictors for y and run 
Lasso of d on x to select a set of predictors for d. ao is then estimated by running OLS 
regression of y on d and the union of the sets of regressors selected in the two Lasso runs, 
and inference is simply the usual heteroscedasticity robust OLS inference from this regression. 
Post-Double-Selection + Ridge is an ad hoc variant of Post-Double-Selection in which we add 
the ridge fit from equation (|6,34p as an additional potential regressor that may be selected by 
Lasso. The ridge fit is obtained with a single ridge penalty parameter that is chosen using 
10-fold cross-validation. This procedure is motivated by a desire to add further robustness in 
the case that many small coefficients are suspected. Further exploration of procedures that 
perform well, both theoretically and in simulations, in the presence of many small coefficients 
is an interesting avenue for additional research. 

We start by summarizing results in Table 1 for (R^Rj) = (0, .2), (0, .8), (.8, .2), and (.8, .8) 
where R 2 is the population R 2 from regressing y on x (Structure R 2 ) and R 2 , is the population 
R 2 from regressing d on x (First Stage R 2 ). We report root-mean-square-error (RMSE) for 
estimating ao and size of 5% level tests (Rej. Rate). As should be the case, the Oracle 
and Double-Selection Oracle, which are reported to provide the performance of an infeasible 
benchmark, perform well relative to the feasible procedures across the three designs. We 
do see that the feasible Post-Double-Selection procedures perform similarly to the Double- 
Selection Oracle without relying on ex ante knowledge of the coefficients that go in to the 
control functions, 9 g and m . On the other hand, the Post-Lasso procedure generally does 
not perform as well as Post-Double-Selection and is very sensitive to the value of R 2 ,. While 
Post-Lasso performs adequately when R 2 , is small, its performance deteriorates quickly as 
increases. This lack of robustness of traditional variable selection methods such as Lasso 
which were designed with forecasting, not inference about treatment effects, in mind is the 
chief motivation for our advocating the Post-Double-Selection procedure when trying to infer 
structural or treatment parameters. 

We provide further details about the performance of the feasible estimators in Figures 1, 
2, and 3 which plot size of 5% level tests, bias, and standard deviation for the Post-Lasso, 
Double-Selection (DS), and Double-Selection Oracle (DS Oracle) estimators of the treatment 
effect across the full set of R 2 values considered. Figure 1, 2, and 3 respectively report the 
results from Design 1,2, and 3. The figures are plotted with the same scale to aid comparability 
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and for readability rejection frequencies for Post-Lasso were censored at .5. Perhaps the most 
striking feature of the figures is the poor performance of the Post-Lasso estimator. The Post- 
Lasso estimator performs poorly in terms of size of tests across many different R 2 combinations 
and can have an order of magnitude more bias than the corresponding Post-Double-Selection 
estimator. The behavior of Post-Lasso is quite non-uniform across R 2 combinations, and Post- 
Lasso does not reliably control size distortions or bias except in the case where the controls 
are uncorrelated with the treatment (where First-Stage R 2 equals 0) and thus ignorable. In 
contrast, the Post-Double-Selection estimator performs relatively well across the full range 
of R 2 combinations considered. The Post-Double-Selection estimator's performance is also 
quite similar to that of the infeasible Double-Selection Oracle across the majority of R 2 values 
considered. Comparing across Figures 1 and 2, we see that size distortions for both the 
Post-Double-Selection estimator and the Double-Selection Oracle are somewhat larger in the 
presence of heteroscedasticity but that the basic patterns are more-or-less the same across the 
two figures. Looking at Figure 3, we also see that the addition of small independent random 
coefficients results in somewhat larger size distortions for the Post-Double-Selection estimator 
than in the other homoscedastic design, Design 1, though the procedure still performs relatively 
well. 

In the final figure, Figure 4, we compare the performance of the Post-Double-Selection 
procedure to the ad hoc Post-Double-Selection procedure which selects among the original 
set of variables augmented with the ridge fit obtained from equation (|6.34|) . We see that the 
addition of this variable does add robustness relative to Post-Double-Selection using only the 
raw controls in the sense of producing tests that tend to have size closer to the nominal level. 
This additional robustness is a good feature, though it comes at the cost of increased RMSE 
which is especially prominent for small values of the first-stage R 2 . 

The simulation results are favorable to the Post-Double-Selection estimator. In the simula- 
tions, we see that the Post-Double-Selection procedure provides an estimator of a treatment 
effect in the presence of a large number of potential confounding variables that performs simi- 
larly to the infeasible estimator that knows the values of the coefficients on all of the confound- 
ing variables. Overall, the simulation evidence supports our theoretical results and suggests 
that the proposed Post-Double-Selection procedure can be a useful tool to researchers doing 
structural estimation in the presence of many potential confounding variables. It also shows, 
as a contrast, that the standard Post-Single-Selection procedure provides poor inference and 
therefore can not be a reliable tool to these researchers. 
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7. Empirical Example: Estimating the Effect of Abortion on Crime 

In the preceding sections, we have provided results demonstrating how variable selection 
methods, focusing on the case of Lasso-based methods, can be used to estimate treatment ef- 
fects in models in which we believe the variable of interest is exogenous conditional on observ- 
ables. We further illustrate the use of these methods in this section by reexamining Donohue 
III and Levitt's (2001) study of the impact of abortion on crime rates. In the following, we 
briefly review Donohue III and Levitt (2001) and then present estimates obtained using the 
methods developed in this paper. 

Donohue III and Levitt (2001) discuss two key arguments for a causal channel relating 
abortion to crime. The first is simply that more abortion among a cohort results in an otherwise 
smaller cohort and so crime 15 to 25 years later, when this cohort is in the period when its 
members are most at risk for committing crimes, will be otherwise lower given the smaller 
cohort size. The second argument is that abortion gives women more control over the timing 
of their fertility allowing them to more easily assure that childbirth occurs at a time when a 
more favorable environment is available during a child's life. For example, access to abortion 
may make it easier to ensure that a child is born at a time when the family environment is 
stable, the mother is more well-educated, or household income is stable. This second channel 
would mean that more access to abortion could lead to lower crime rates even if fertility rates 
remained constant. 

The basic problem in estimating the causal impact of abortion on crime is that state-level 
abortion rates are not randomly assigned, and it seems likely that there will be factors that 
are associated to both abortion rates and crime rates. It is clear that any association between 
the current abortion rate and the current crime rate is likely to be spurious. However, even 
if one looks at say the relationship between the abortion rate 18 years in the past and the 
crime rate among current 18 year olds, the lack of random assignment makes establishing a 
causal link difficult without adequate controls. An obvious confounding factor is the existence 
of persistent state-to-state differences in policies, attitudes, and demographics that are likely 
related to the overall state level abortion and crime rates. It is also important to control 
flexibly for aggregate trends. For example, it could be the case that national crime rates 
were falling over this period while national abortion rates were rising but that these trends 
were driven by completely different factors. Without controlling for these trends, one would 
mistakenly associate the reduction in crime to the increase in abortion. In addition to these 
overall differences across states and times, there are other time varying characteristics such as 
state-level income, policing, or drug-use to name a few that could be associated with current 
crime and past abortion. 
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To address these confounds, Donohue III and Levitt (2001) estimate a model for state-level 
crime rates running from 1985 to 1997 in which they condition on a number of these factors. 
Their basic specification is 



where i indexes states, t indexes times, c £ {violent, property, murder} indexes type of crime, 
Si are state-specific effects that control for any time-invariant state-specific characteristics, 
7t are time-specific effects that control flexibly for any aggregate trends, i% are a set of 
control variables to control for time- varying confounding state-level factors, a c u is a measure 
of the abortion rate relevant for type of crime c0 and y c u is the crime-rate for crime type c. 
Donohue III and Levitt (2001) use the log of lagged prisoners per capita, the log of lagged police 
per capita, the unemployment rate, per-capita income, the poverty rate, AFDC generosity at 
time t — 15, a dummy for concealed weapons law, and beer consumption per capita for wu, the 
set of time-varying state-specific controls. Tables IV and V in Donohue III and Levitt (2001) 
present baseline estimation results based on (|7.35p as well as results from different models 
which vary the sample and set of controls to show that the baseline estimates are robust to 
small deviations from (|7.35p . We refer the reader to the original paper for additional details, 
data definitions, and institutional background. 

For our analysis, we take the argument that the abortion rates defined above may be taken as 
exogenous relative to crime rates once observables have been conditioned on from Donohue III 
and Levitt (2001) as given. Given the seemingly obvious importance of controlling for state 
and time effects, we account for these in all models we estimate. We choose to eliminate the 
state effects via differencing rather than including a full set of state dummies but include a full 
set of time dummies in every model. Thus, we will estimate models of the form 



We use the same state-level data as Donohue III and Levitt (2001) but delete Alaska, Hawaii, 
and Washington, D.C. which gives a sample with 48 cross-sectional observations and 12 time 
series observations for a total of 576 observations. With these deletions, our baseline estimates 
using the same controls as in (|7.35p are quite similar to those reported in Donohue III and 
Levitt (2001). Baseline estimates from Table IV of Donohue III and Levitt (2001) and our 

14 This variable is constructed as weighted average of abortion rates where weights are determined by the 
fraction of the type of crime committed by various age groups. For example, if 60% of violent crime were 
committed by 18 year olds and 40% were committed by 19 year olds in state i, the abortion rate for violent 
crime at time t in state i would be constructed as .6 times the abortion rate in state i at time t — 18 plus .4 
times the abortion rate in state i at time t — 19. See Donohue III and Levitt (2001) for further detail and exact 
construction methods. 



Udt = aa cit + w' it P + Si + j t + £it 



(7.35) 




(7.36) 
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baseline estimates based on the differenced version of (|7.35p are given in the first and second 
row of Table 2 respectively. 

Our main point of departure from Donohue III and Levitt (2001) is that we allow for a much 
richer set za than allowed for in wu in model (|7.35p . Our z%± includes higher-order terms and 
interactions of the control variables defined above. In addition, we put initial conditions and 
initial differences of wu and an into our vector of controls zu- This addition allows for the 
possibility that there may be some feature of a state that is associated both with its growth 
rate in abortion and its growth rate in crime. For example, having an initially high-levels of 
abortion could be associated with having high-growth rates in abortion and low growth rates 
in crime. Failure to control for this factor could then lead to misattributing the effect of this 
initial factor, perhaps driven by policy or state-level demographics, to the effect of abortion. 
Finally, we allow for more general trends by allowing for an aggregate quadratic trend in zu 
as well as interactions of this quadratic trend with control variables. This gives us a set of 251 
control variables to select among in addition to the 12 time effects that we include in every 
model0 

Note that interpreting estimates of the effect of abortion from model (|7.35p as causal relies 
on the belief that there are no higher-order terms of the control variables, no interaction terms, 
and no additional excluded variables that are associated both to crime rates and the associated 
abortion rate. Thus, controlling for a large set of variables as described above is desirable from 
the standpoint of making this belief more plausible. At the same time, naively controlling 
lessens our ability to identify the effect of interest and thus tends to make estimates far less 
precise. The effect of estimating the abortion effect conditioning on the full set of 251 potential 
controls described above is given in the third row of Table 2. As expected, all coefficients are 
estimated very imprecisely. Of course, very few researchers would consider using 251 controls 
with only 576 observations due to exactly this issue. 

We are faced with a tradeoff between controlling for very few variables which may leave 
us wondering whether we have included sufficient controls for the exogeneity of the treatment 
and controlling for so many variables that we are essentially mechanically unable to learn 
about the effect of the treatment. The variable selection methods developed in this paper 
offer one resolution to this tension. The assumed sparse structure maintains that there is a 
small enough set of variables that one could potentially learn about the treatment but adds 
substantial flexibility to the usual case where a researcher considers only a few control variables 
by allowing this set to be found by the data from among a large set of controls. Thus, the 

-^The exact identities of the 251 potential controls is available upon request. It consists of linear and 
quadratic terms of each continuous variable in wu, interactions of every variable in Wit, initial levels and initial 
differences of wu and an, and interactions of these variables with a quadratic trend. 
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approach should complement the usual careful specification analysis by providing a researcher 
an efficient, data-driven way to search for a small set of influential confounds from among a 
sensibly chosen broad set of potential confounding variables. 

In the abortion example, we use the post-double-selection estimator defined in Section 12.21 
for each of our dependent variables. For violent crime,ten variables are selected in the abortion 
equation^ and one is selected in the crime equation^ For property crime, eight variables are 
selected in the abortion equation^ and six are selected in the crime equationQ For murder, 
eight variables are selected in the abortion equationQ and none were selected in the crime 
equation. 

Estimates of the causal effect of abortion on crime obtained by searching for confounding 
factors among our set of 251 potential controls are given in the fourth row of Table 2. Each of 
these estimates is obtained from the least squares regression of the crime rate on the abortion 
rate and the 11, 14, and eight controls selected by the double-post-Lasso procedure for violent 
crime, property crime, and murder respectively. The estimates for the effect of abortion on 
violent crime and the effect of abortion on murder are quite imprecise, producing 95% con- 
fidence intervals that encompass large positive and negative values. The estimated effect for 
property crime is roughly in line with the previous estimates though it is no longer significant 
at the 5% level but is significant at the 10% level. Note that the double-post-Lasso produces 
models that are not of vastly different size than the "intuitive" model (|7.35p . As a final check, 
we also report results that include all of the original variables from (|7.35|) in the amelioration 
set in the fifth row of the table. These results show that the conclusions made from using 
only the variable selection procedure do not qualitatively change when the variables used in 
the original Donohue III and Levitt (2001) are added to the equation. For a quick benchmark 



16 The selected variables are AFDC generosity squared, beer consumption squared, the initial poverty change, 
initial income, initial income squared, the initial change in prisoners per capita squared interacted with the trend, 
initial income interacted with the trend, the initial change in the abortion rate, the initial change in the abortion 
rate interacted with the trend, and the initial level of the abortion rate. 

17 The initial level of the abortion rate interacted with time is selected. 

18 The selected variables are income, the initial poverty change, the initial change in prisoners per capita 
squared, the initial level of prisoners per capita, initial income, the initial change in the abortion rate, the initial 
change in the abortion rate interacted with the trend, and the initial level of the abortion rate. 

19 The six variables are the initial level of AFDF generosity, the initial level of income interacted with the 
trend and the trend squared, the initial level of income squared interacted with the trend and the trend squared, 
and the initial level of the abortion rate interacted with the trend. 

20 The selected variables are AFDC generosity, beer consumption squared, the change in beer consumption 
squared, the change in beer consumption squared times the trend and the trend squared, initial income times 
the trend, the initial change in the abortion rate interacted with the trend, and the initial level of the abortion 
rate. 
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relative to the simulation examples, we note that the R 2 obtained by regressing the crime rate 
on the selected variables are .0395, .1185, and .0044 for violent crime, property crime, and the 
murder rate respectively and that the R 2, s from regressing the abortion rate on the selected 
variables are .9447, .9013, and .9144 for violent crime, property crime, and the murder rate 
respectively. These values correspond to regions of the R 2 space considered in the simulation 
where the double selection procedure substantially outperformed simple Lasso procedures. 

It is very interesting that one would draw qualitatively different conclusions from the esti- 
mates obtained using formal variable selection than from the estimates obtained using a small 
set of intuitively selected controls. Looking at the set of selected control variables, we see that 
initial conditions and interactions with trends are selected across all dependent variables. The 
selection of this set of variables suggests that there are initial factors which are associated with 
the change in the abortion rate. We also see that we cannot precisely determine the effect of 
the abortion rate on crime rates once one accounts for initial conditions. Of course, this does 
not mean that the effects of the abortion rate provided in the first two rows of Table 2 are 
not representative of the true causal effects. It does, however, imply that this conclusion is 
strongly predicated on the belief that there are not other unobserved state-level factors that 
are correlated to both initial values of the controls and abortion rates, abortion rate changes, 
and crime rate changes. Interestingly, a similar conclusion is given in Foote and Goetz (2008) 
based on an intuitive argument. 

We believe that the example in this section illustrates how one may use modern variable 
selection techniques to complement causal analysis in economics. In the abortion example, 
we are able to search among a large set of controls and transformations of variables when 
trying to estimate the effect of abortion on crime. Considering a large set of controls makes 
the underlying assumption of exogeneity of the abortion rate conditional on observables more 
plausible, while the methods we develop allow us to produce an end-model which is of manage- 
able dimension. Interestingly, we see that one would draw quite different conclusions from the 
estimates obtained using formal variable selection. Looking at the variables selected, we can 
also see that this change in interpretation is being driven by the variable selection method's 
selecting different variables, specifically initial values of the abortion rate and controls, than 
are usually considered. Thus, it appears that the usual interpretation hinges on the prior belief 
that initial values should be excluded from the structural equation. 

8. Conclusion 

In this paper, we consider estimation of treatment effects or structural parameters in an 
environment where the treatment is believed to be exogenous conditional on observables. We 
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do not impose the conventional assumption that the identities of the relevant conditioning vari- 
ables and the functional form with which they enter the model are known. Rather, we assume 
that the researcher believes there is a relatively small number of important factors whose iden- 
tities are unknown within a much larger known set of potential variables and transformations. 
This sparsity assumption allows the researcher to estimate the desired treatment effect and 
infer a set of important variables upon which one needs to condition by using modern variable 
selection techniques without ex ante knowledge of which are the important conditioning vari- 
ables. Since naive application of variable selection methods in this context may result in very 
poor properties for inferring the treatment effect of interest, we propose a "double-selection" 
estimator of the treatment effect, provide a formal demonstration of its properties for estimat- 
ing the treatment effect, and provide its approximate distribution under technical regularity 
conditions and the assumed sparsity in the model. 

In addition to the theoretical development, we illustrate the potential usefulness of our 
proposal through a number of simulation studies and an empirical example. In Monte Carlo 
simulations, our procedure outperforms simple variable selection strategies for estimating the 
treatment effect across the designs considered and does relatively well compared to an infeasible 
estimator that uses the identities of the relevant conditioning variables. We then apply our 
estimator to attempt to estimate the causal impact of abortion on crime following Donohue III 
and Levitt (2001). We find that our procedure selects a small number of conditioning variables. 
After conditioning on these selected variables, one would draw qualitatively different inference 
about the effect of abortion on crime than would be drawn if one assumed that the correct 
set of conditioning variables was known and the same as those variables used in Donohue III 
and Levitt (2001). Taken together, the empirical and simulation examples demonstrate that 
the proposed method may provide a useful complement to other sorts of specification analysis 
done in applied research. 



Appendix A. Iterated Estimation of Penalty Loadings 

In the case of Lasso under heteroscedasticity, we must specify for the penalty loadings (|2.13|) . 
Here we state algorithms for estimating these loadings. 

Let Iq be an initial set of regressors with bounded number of elements, including for ex- 
ample intercept. Let /3(io) be the least squares estimator of the coefficients on the covariates 
associated with Jo, and define Ijo := yj M n [x?j {yi — x'^ (Iq ) ) 2 ] ■ 

An algorithm for estimating the penalty loadings using Post-Lasso is as follows: 
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Algorithm 1 (Estimation of Lasso loadings using Post-Lasso iterations). Set lj$ := lji , 
j = 1, . . . ,p. Set k = 0, and specify a small constant v ^ as a tolerance level and a constant 
K > 1 as an upper bound on the number of iterations. (1) Compute the Post-Lasso estimator (3 
based on the loadings lj±- (2) For's= ||/3||o = \T\ setlj^+i '■= y^ n [x 2 j(yi — x' i P) 2 ]^n/(n — s). 
(3) If maxi^j<jp \ lj t fi — lj t k+i\ ^ v or k > K , set the loadings to lj t k+i> 3 = 1> • • • iV an d stop; 
otherwise, set k <— k + 1 and go to (1). 

A similar algorithm can be defined for using with Post- Square-root Lasso instead of Post- 
Lasso0 

Algorithm 2 (Estimation of Square-root Lasso loadings using Post-Square-root Lasso itera- 
tions). Set k = 0, and specify a small constant v ^ as a tolerance level and a constant K > 1 
as an upper bound on the number of iterations. (1) Compute the Post-Square-root Lasso es- 
timator f3 based on the loadings Ij^. (2) Set lj t k+\ '■= \J^"n[x 2 j(yi — x 'iP) 2 ]/ \j^n[{yi — x'^/3) 2 ]. 
(3) 7/maxi^jsgp — ljk-\-±\ ^ v or k > K, set the loadings to ljk+l; j = 1> • • • >P> an d stop; 
otherwise set k ^— k + 1 and go to (1). 

Appendix B. Proof of Theorem Q] 

The proof proceeds under given sequence of probability measures {Pn}, as n — > oo. 

Let Y = [yi,...,y n ]', X = [xi,...,x n ]', D = [di, d n ]', V = [ui, ...,«„]', C = [Ci,-,Cn]', 
m = [mi,...,m n ]', R m = [r m i, r mn ]', g = [gi, ...,g n ]', R g = [r g x, r gn ]', and so on. For 
A C {1, let X[A] = {Xj,j E A}, where {Xj, j = 1, are the columns of X. Let 

V A = X[A](X[A] / X[A])-X[A]' 

be the projection operator sending vectors in W 1 onto span [X LA]], and let Ma = In — Va be 
the projection onto the subspace that is orthogonal to span[XL4]]. For a vector Z 6 M n , let 

Pz(A) := arg min \\Z - X'b\\ 2 : bj = 0, Vj A, 



be the coefficient of linear projection of Z onto span[XL4]]. If A = 0, interpret Va = n , and 
Pz = P . 

Finally, denote 4> miQ (m) = (j)m.m(i r n)[E n [x i x' i \] and max (m) = (/> m a X (m)[E n [xiX-]]. 
Step l.(Main) Write a = [D'MfD/n] -1 [D'MfY/n] so that 

V^(d-ao) = [D'MfD/n]' 1 [D'Mj~(g + =: it -1 -*. 



21 The algorithms can also be modified in the obvious manner for Lasso or Square-root Lasso. 
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By Steps 2 and 3, 

U = V'V/n + op(l) and i = V'C/y/n + o P (l). 

Next note that V'V/n = E[V'V/n] + op(l) by Chebyshev, and because E [V'V/n] is bounded 
away from zero and from above uniformly in n by Condition SM, we have = E[V'V/n] _1 + 
o P (l). 

By Condition SM a 2 = E[-y?]~ 1 E[^?-y?]E[u^] _1 is bounded away from zero and from above, 
uniformly in n. Hence 

n 

Z n = <j~ l ^/n(a - oq) = n~ 1/2 ^ z i<n + o P (l), 

i=i 

where Zi >n := a~ ViQ are i.n.i.d. with mean zero. For 5 > such that 4 + 26 ^ q 



E\z t , n \ 2+S < E [|^| 2+<5 !0| 2+5 J < ^E|^|4+2yE|C 4 | 4+2(5 < 1, 

by Condition SM. This condition verifies the Lyapunov condition and thus application of the 
Lyapunov CLT for i.n.i.d. triangular arrays implies that 

Z„~»JV(0,1). 

Step 2. (Behavior of i.) Decompose, using D = m + V , 

i = V'C/Vn + m'M r g/Vn + m'MjC/Vn + V'Mfg/y/n - V'Vj^/Vn. 

= '.la = '^b = '%c = '^d 

First, by Step 5 and 6 below we have 



\i a \ = \m'Mjg/y/n\ < ^/n\\Mfg/Vn\\\\Mfm/Vn\\ < P y/[s log(p V n)] 2 /n = o(l), 
where the last bound follows from the assumed growth condition s 2 log 2 (p V n) = o(n). 
Second, using that m = Xf3 m Q + R m and ml \MjC = R'mC ~ (fim(I) — Pmo)'X'C > conclude 



\i b \ < |i4C/\/n"l + I CM') - Pmo)'X'C/>fr\ <p V> log(p V n)] 2 /n = o P (l). 
This follows since 

l-fCC/V^I y/R'mRm/n <P y r s/n, 
holding by Chebyshev inequality and Conditions SM and ASTE(iii), and 



\(Pm(I) ~ Pmo)'X'C/Vn~\ < \\P m (I) - ^olllllX'C/^Hoc <P VF V n)]/ny/\og(p V n). 
The latter bound follows by (a) 

||/3mC0 - /Mil < + a\\pm(l) - Pmo\\ <P vV log(pVn)]/n 
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holding by Step 5 and by s <p s implied by Lemma [Q and (b) by 

\\X'(/V^\\oc <P 0og(p V n) 
holding by Step 4 under Condition SM. 

Third, using similar reasoning, decomposition g = X(3 g o + R g , and Steps 4 and 6, conclude 
\i c \ < \R' g V/V^\ + \0 9 (I) - PgoYX'V/VHl < P ^[s\og{ P Vn)Y/n = o P (l). 

Fourth, we have 

\i d \ ^ \p v {I)'X'C/^\ < Wv{I)\\i\\X'Q/^U < P v /[slog(pVn)]2/n = op(l), 
since by Step 4 below ||X'C/\/^||oo <p ^/log(p V n), and 

\\Pv{i)h < v^||/3 y (/)|K v^lKxi/rxi/i/n)- 1 ^!/]^!! 

The latter bound follows from s <p s, holding by LemmadJ so that 0~; n (s) <p 1 by Condition 
SE, and from H^C'V/v^raHcc <p ^/log(p V n) holding by Step 4. 

Step 3. (Behavior of ii.) Decompose 

ii = (m + V)'M T {m + V)/n = V'V/n + m'Mjm/n + 2m'MfV/n - V'VjV/n. 

Then |u a | <p [slog(p Vn)]/n = op(l) by Step 5, <p [s log(pV n)]/n = o P (l) by reasoning 
similar to deriving the bound for \ib\, and \ii c \ <p [slog(p Vn)]/n = op(l) by reasoning similar 
to deriving the bound for 

Step 4. (Auxiliary: Bounds on ||X / C/y / n|| 00 and H^'V/y^Hoo) Here we show that 

(a) IIX'C/v^lU 0og(pVn) and (b)||XV/V^||oo <P Vl°g(pVn). 

To show (a), we use Lemma |4] stated in Appendix F on the tail bound for self- normalized 
deviations to deduce the bound. Indeed, we have that wp — > 1 for some £ n — > oo but so slowly 
that I/7 = t n < logn, with probability 1 — o(l) 

< (* " 2^) ~ ^21og(2£„p) < ^log(p V n). (B.37) 

By Lemma |4] the first inequality in (|B.37j) holds, provided that for all n sufficiently large the 
following holds, 

/ 1 \ „V6 E\x 2 -( 2 V /2 



n 



-1/2 



max 



Yli=l x ij(i 
En[4Cf] 
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Since we can choose £ n to grow as slowly as needed, a sufficient condition for this are the 
conditions: 

logp = o(n 1 ^ 3 ) and min Mj > 1, 

which both hold by Condition SM. Finally, 

max E n [4<f] <P 1, (B.38) 

by Condition SM. Therefore (a) follows from the bounds (|B.37j) and (IB.38j) . Claim (b) follows 
similarly. 

Step 5. (Auxiliary: Bound on ||A^jm|| and related quantities.) This step shows that 



(a) \\Mfm/y/n\\ < P y/[s log(p V n)]/n and (b) ||/3 m (J) - /3 m0 || < P y/[s log(p V n)]/n. 
Observe that 

V[slog(p Vn)]/n > P \\M?m/y/n\\ > P \\Mfm/y/n\\ 
(i) (2) 

where inequality (1) holds since by Lemma 1 \\M. jm/y/n\\ ^ \\{X^p>{Ii) — m)/^/n\\ <p 
y/[s log(p V n)]/n, and (2) holds by I\ C I by construction. This shows claim (a). To show 
claim (b) note that 

\\Mfm/yfa\\ >P \\\X{~P m (l)-Pm0)/M\ ~ \\Rm/M\\ 
(3) 

where (3) holds by the triangle inequality. Since H-Km/y^H <p y/s/n by Chebyshev and 
Condition ASTE(iii), conclude that 

V[slog(p Vn)]/n > P \\X(p m (T)-P m0 )/V^\\ 



> \/<Pmin{s + s)\\f3 m (I) - P m0 \\ >P ||/3 m (J) - /3 m0 ||, 

since s<p s by LemmaQ]so that l/(/> m in(s + s) <p 1 by condition SE. This shows claim (b). 
Step 6. (Auxiliary: Bound on ||.Mj-<7|| and related quantities.) This step shows that 



(a) \\M ? g/V^\\ <P y/[s\og(pVn)]/n and (b) ||/3 S (I) - (3 g0 \\ < P \/[s log(p V n)]/n. 
Observe that 

V[slog(p V n)]/n > P (aom + ^/v^H 

(i) 



> P ||7W f (a ?n + 5 )/v^|| 
(2) 

> P HIM^/V^II-ll^yaom/V^HI 

(3) 
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where inequality (1) holds since by Lemma [J \\M.j 2 (aom + g)/y/n\\ ^ \\{X pYxih) — a^m — 
g)/y/n\\ <p ^J[s \og(p V n)]/n, (2) holds by I2 C I, and (3) by the triangle inequality. Since 
||ao|| is bounded uniformly in n by assumption, by Step 5, ||.A/f ^ao^/v^ll ^Sp Vi s ^°&(p V n )]/ n - 
Hence claim (a) follows by the triangle inequality: 

V[slog(pVn)]/n > P \\M T g/M\ 
To show claim (b) we note that 

\\M T g/V^\\ > W\x0 g (T) - M/V^W - \\R g /M\\ 

where ||i2 s /\/n|| <p \/s/n by Condition ASTE(iii). Then conclude similarly to Step 5 that 

V[slog(pVn)]/n > P ||X(4(I)-^o)/v^ll 

> V<l>mUs + s)\\P g (T) - Pgo\\ > P ||/3 9 (J) - (3 g0 \\. 

Step 7. (Variance Estimation.) Since is <p s = o(n), (n — ? — l)/n = op(l), and since 
E[t> 2 £ 2 ] and E[v 2 ] are bounded away from zero and from above uniformly in n by Condition 
SM, it suffices to show that 

E n [vfcf] - E[v 2 (f} 0, E n [v 2 ] - E[v 2 } -> P 0, 

The second relation was shown in Step 3, so it remains to show the first relation. 

Let Vi = v i + r m i and Q = Q + r g i. Recall that by Condition ASTE(v) we have E[£ 2 C 2 ] — 
E[t>?^ 2 ] — > 0, and E n [u 2 £ 2 ] — E[{> 2 £ 2 ] — >p by Vonbahr-Esseen's inequality in von Bahr and 
Esseen (1965) since E[|^0| 2+<5 ] (E^l^ 25 ]^^-^ 25 ]) 1 ^ i s uniformly bounded for 4+25 ^ q. 
Thus it suffices to show that K n [vf(f] - E n [vf(f] —> P 0. 

By the triangular inequality 

\®n[v!C! ~ v*C?}\ ^ \E n [(tf ~ v 2 )C, 2 ]\ + \En$@ ~ Ci)]\- 

=:iv =:iii 

Then, expanding ^ — £ 2 we have 

Hi ^ 2K n [{d i (a - a)} 2 v 2 ) + 2E n [{x' i - f3 g0 )} 2 v 2 } 
+\2K n [C iC l t (a - a)v 2 }\ + |2E n [<^(/3 - (3 g0 )v 2 }\ 
= : iii a + ii% + iii c + Hid = op(l) 

where the last bound follows by the relations derived below. 
First, we note 

iii a ^ 2maxd 2 \a - a\ 2 E n [v 2 ] <p n (2/<?)_1 = o(l) (B.39) 
iii c ^ 2max{|C i |K|}E n [£ 2 ]|a -a| < P n^^^ = o(l) (B.40) 
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which holds by the following argument. Condition SM assumes that E[|dj| g ] which in turn 
implies that E[maxj^ n ci 2 ] < re 2 / 9 . Similarly Condition ASTE implies that Efmaxj^ £ 2 ] < n 2 / q 
and E[maxj^ n v 2 ] < n 2 l q . Thus by Markov inequality 

max \di\ + \(i\ + \vi\ < P n l l q . (B.41) 

Moreover, E n [?; 2 ] <p 1 and \a — ao\ <p re -1 / 2 by the previous steps. These bounds and q > 4 
imposed in Condition SM imply (|B.39j) - (IB.40l) . 



Next we bound, 



iii d < 2max\( i \max\x' i ($ - /3 g0 )\E n [v. 



Mn s slog(pVn) . . 

P n^niaxl xi 00^/-= ^= '- = o P (l), (B.42) 



< 

using (|B.41[) and that for T fl = support(/3 9 o) U /, we have 

max{x-(/3 — /3 g o)} 2 < max||a;.^ || 2 ||/; 



where 



max 1 1 || 2 < \T g \ max {{x^ < P smaxUxjH^ 



by the sparsity assumption in ASTE and the sparsity bound in Lemma [TJ and since 
(X[I]'X[I])-X[T]'(C + g-{a- a )D) we have 



11/3 " M < " M + ||/3 C (/)|| + |a - a | • \\Pd(T)\\ < P ^log(pVn)/n 

by Step 6(b), by 

||/%(/)|| < ^^(^HA'C/nlU < P V«log(pVn)/n 

holding by Condition SE and by s<p s from Lemma [Q and by Step 4, |a — ao| <p l/v 7 "- by 
Step 1, and 



ll/W)|| < ^ n (?)Vf max |E n [^]| < ti n (s)V? m^ <^ ^ 

by Condition SE, s <p s by the sparsity bound in Lemma [TJ and Condition SM. 
The final conclusion in (|F3.42|) then follows by condition ASTE (iv) and (v). 
Next, using the relations above and condition ASTE (iv) and (v), we also conclude that 
iii b < 2max{i-(/3-^ )} 2 En[!'i] 



2 s slog(pVn) 



<P max ||xi 11^-= -= = o P (l). (B.43) 

i^n Jn \ n 
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Finally, the argument for iv = op(l) follows similarly to the argument for Hi = op(l) and 
the result follows. □ 

Appendix C. Proof of Corollary 1 

Let P n be a collection of probability measures P for which conditions ASTE (P), SM (P), SE 
(P), and R (P) hold for the given n. Consider any sequence {P n } ; with index n G {no, no+1, ...}, 
with P n G P n for each n G {no, no + 1, ...}. By Theorem 1 we have that, for c = <J> -1 (1 — 7/2), 
lim n ^.oo P n (ao G [a ± (S n jypn}\) = $(c) — c) = 1 — 7. This means that for every further 
subsequence {Pn fc } with P nk G P nfc for each k G {1,2, ...} 

lim P nk (a G [a± cd n J^\) = 1 - 7. (C.44) 

k— >oo 



Suppose that the claim of corollary does not hold, i.e. 



lim sup sup 

n^oo PePn 



> 0. 



P (a G [a ± ca n / y/n\) - (1 - 7) 

Hence there is a subsequence {P nfe } with P nfe G P nfc for each k G {1, 2, ...} such that: 

lim P nfc (a G [a ± ca nk /^/n^\) / 1 - 7. 

A;— >oo 

This gives a contradiction to (|C.44h . The claim (i) follows. Claim (ii) follows from claim (i), 
since P C P n for all n ^ no- □ 

Appendix D. Proof of Theorem [2] 

We use the same notation as in Theorem [TJ Using that notation the approximations bounds 
stated in Condition HLMS are equivalent to ||A4yn|| ^ 5 n n 1 / 4 and ||A4j-m|| ^ 5 n n 1 / 4 . 

Step 1 . It follows the same reasoning as Step 1 in the proof of Theorem [TJ 

Step 2. (Behavior of i.) Decompose, using D = m + V 

i = V'C/Vn + m'M f g/Vn + m'MjC/V^ + V'M f g/^ - Wj^/Vn. 

First, by Condition HLMS we have H-Mj-gH = op(n 1 / 4 ) and ||A^ym|| = op(n 1 / 4 ). Therefore 

\i a \ = \m'M ? g/V^\ < V^\\M T g/V^\\\\M f m/VTi\\ < P o(l). 
Second, using that m = X{3 m o + R m and m'Aij^ = R' m ( — m (I) — f3 m Q)'X'(, we have 

|t b | < \R' m C/V^\ + \0m(T) ~ P m o)'X>(/V^\ 

< \R' m (/V^\ + Wm{I) ~ Anolllll^C/ V^lloo 

< P \J sjn + y/s {o(n~ 1 / 4 ) + y/s/n}y/log(p V n) = o(l). 
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This follows because 

\R'mC/Vn\ V R 'm R m./n < P \fsjn, 
by Chebyshev inequality and Conditions SM and ASTE(iii), 

Wm{I) ~ /3 m0 ||i < Vf+~s\\p m {I) - /3 m0 || < P x/i {o(n^ A ) + s/JJn}, 

by Step 4 and s = |/| <p s by Condition HLMS, and 



\\X't/M\oo <P V^g(p V n) 
holding by Step 4 in the proof of Theorem 1. 

Third, using similar reasoning and the decomposition g = X/3 g Q + R g conclude 

\i c \ ^ \R' g v/v^\ + \0 9 (T) - M x 'v/V^\ 

< P \J s/n + ^fs {o(n -1 / 4 ) + \J s/n} ^J\og(p V n) = op(l). 

Fourth, we have 



\i d \ < \p v (I)'X'C/V^\ < H^WIIill^C/v^lloo <P v / [slog(pVn)] 2 /n = o P (l), 
since llX'^/y^nHoo <p \/\og(p V n) by Step 4 of the proof of Theorem 1, and 
Wv(I)h < v^||^(/)|| < ^\\{X[I]'X[I)/n)^X[ I)'V/n\\ 

The latter bound follows from 's <p s by condition HLMS so that 4>^ n (s) <p 1 by condition SE, 
and again invoking Step 4 of the proof of Theorem 1 to establish H-X 7 V/ -v/n||oo ^$p \/log(p V n). 

Step 3. (Behavior of ii.) Decompose 

U = ( m + VyXf (m + F)/n = V'V/n + m'Mpn/n + 2m'MfV/n - V'VfV/n. 

= .%%d —'-lib = '-%%c 

Then \ii a \ <p o(n 1 / 2 )/n = op(n~ 1 / 2 ) by condition HLMS, = o{n~ l l 2 ) by reasoning similar 
to deriving the bound for and |u c | <p [slog(p V n)]/n = op(l) by reasoning similar to 
deriving the bound for 

Step 4. (Auxiliary: Bounds on ||/3 m (J) — /3 TO o|| and (/) — /3 ff o||-) To establish a bound on 
\\P g (r) -Pgo || note that 

HMjsMill ^ I lim(?) -M/v^|| - ll^/v^ll I 
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where ||i? 5 /\/n|| <p \fs~fn holds by Chebyshev inequality and Condition ASTE(iii). Moreover, 
by Condition HLMS we have \\Mjg/y/n\\ = o P (n~ 1 / 4 ) and s = \T\ < P s. Thus 

o(n-V4) + > P \ \X0 g (T)- p g o)/VH\\ 

> 7Q^nyn4(/)-/3 9 oii 



since yf </> m i n (s + s) >p 1 by Condition SE. 
The same logic yields | 



(l)-Pmo\\ <P v^+oCn" 1 / 4 ). 



Step 5. (Variance Estimation.) It follows similarly to Step 7 in the proof of Theorem [T] but 
using Condition HLMS instead of Lemma [TJ 

□ 



Appendix E. Proof of Corollary 2 



The proof is similar to the proof of Corollary 1. 



Appendix F. Verification of Conditions for the Examples 



F.l. Verification for Example 1. Let P be the collection of all regression models P that 
obey the conditions set forth above for all n for the given constants (p,b,B,q x ,q). Below we 
provide explicit bounds for k' , k" , c, C, 5 n and A n that appear in Conditions ASTE, SE and 
SM that depend only on (p,b, B,q x ,q) and n which in turn establish these conditions for any 

PeP. 

Condition ASTE(i) is assumed. Condition ASTE(ii) holds with ||a || < C^ STE = B. Con- 
dition ASTE(iii) holds with s = p and r 



9 1 



0. 



Condition ASTE(iv) holds with 6^ TE := p 2 log 2 (p V n)/n — >■ since s = p is fixed. Finally, 
we verify ASTE(v). Because Vi = Vi, Q = Q an d the moment condition E[|v?|] + E[|£?|] ^ 
Q ASTE _ 2B with q > 4, the first two requirements follow. To show the last requirement, note 
that because E[||xj|| 9a: ] ^ B we have 



P I max 



>ti«KP 



E 



l/qx 



> t 



In 



C nE\ 



]/tf n < nB/tf n =: Ai 



. ASTE 



(F.45) 
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Let ti n = (nlogn) l / qx B 1 / qx so that Af^ TE = 1/ log re. Thus we have with probability 1 

A ASTE 



max 



Hl^sre-^+s/? < {n\ogn) 2 l qx B 2 l qx pn~ l l 2+2 l q =: & 



ASTE 
2n 



It follows that 5 2n STE by the assumption that 4/q x + 4/q < 1. 
To verify Condition SE note that 

P P R r 2 2 1 P P R r 4 l i p[ 4 l 

p(he„m - em ii > t 2n) < £ £ < ^ e lt 2 J 

pE[||x t || 4 ] pi? 4 /^ _ 5£ 
~^t 2 nt 2 ~' ln ' 

m 2n nz 2n 

Setting t 2n '■= 6/2 we have Aff = (2/b) 2 B 4 / q *p/n -> since p is fixed. Then, with probability 
1 — Af,f we have 

A min (E„M) ^ A min (E[^^]) - llEnM - EMU ^ 6/2 =: 



]) < A max (E[xi^]) + ||E n [^^] - E[x^]|| ^ EfH^II 2 ] + 6/2 < 2B 2 ^ 



--: k". 



In the verification of Condition SM note that the second and third requirements in Condition 
SM(i) hold with cf M = b and Cf M = B 2 l q . Condition SM(iii) holds with 5f* f := log 3 p/re -»• 
since p is fixed. 

The first requirement in Condition SM(i) and Condition SM(ii) hold by the stated moment 
assumptions, for ej = V{ and £j = Ci> 2/i = <k an d Vi = Hi, 

E[\4\] ^B=:A 1 

E[\4\] < 2 q ~ 1 E[\x'^ m0 \ q } + 2"" 1 E[|^|] < 2 < 7- 1 E[|NH||/3 m0 || ,? + ^'^HD 
<; 2 *- 1 (B*/4*B'> + B) =: A 2 
E[df] ^ 2 3 (B*/ q *B 4 + B)=: A' 2 
E[yf] ^ 3 3 ||a || 4 E[d 4 ] + 3 3 ||/3 s o|| 4 E[||^|| 4 ] + 3 3 E[C 4 ] 
^ 3 3 B 4 2 3 A' 2 + 3 3 B 4 B 4 /«* + 3 3 B^ q =: A 3 
max E[x%y 2 } ^ max (E[x%}) 1 ' 2 (E[yt\) l l 2 < B 2 ^ (E^ 4 ]) 1 / 2 < B 2 ^(A' 2 V A3) 1 / 2 =: A 4 

max E[|x i7 -eJ 3 ] = max E[|x 3 -|E[|e 3 | I xA] < £ 3/? max E[lx?-|] < B 3 / q+3 / qx =: A 5 
i<j<p i<j<p 3 i^j^p 3 

max l/E[x 2 j} ^ l/A min (E[x^]) < 1/6 =: A 6 

since 4 < q < Thus these conditions hold with Cf M = A 2 V{A 1 + (A' 2 VA 3 ) 1 / 2 +A i +A 5 +A 6 ). 

Next we show Condition SM(iv). By (|F.45|) we have maxi^j^„ HxjH^ ^ (n\ogn) 2 l qx B 2 l qx 
with probability 1 — A^ TE , thus with the same probability 

2 slog(nVp) 2/ n 2 / q *p\og{p Vre) 5M 
max Xj ^ ^ (5 log re) /ya =: d?„ — >■ 

i^n re re 
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since q x > 4 and s = p is fixed. 

Next for = t>j and q = £j we have 



2/i 



by the union bound, Chebyshev inequality and by E[aAe 4 ] = E[x 4 -E[e 4 | xi\] < J B 4 /9+ 4 /'?a ; . 
Letting 5fn^ = B 2 / q+2 ^ qx n~ l / A ->0we have Af^ = pjn 1 ! 2 — > since p, B, q and q x are fixed. 

Next for jji = di and jji = yi we have 

P ( max Iffi - E)\ X 2 v 2 ]\ > M < f E[ ^ ] < gg^! _. A 

by the union bound, Chebyshev inequality and by 

E[* 4 y 4 ] < E[xf//*E[#] 4 /« < E[x?;] 4 ^E[y?] 4 /« < 

holding by Holder inequality where 4 < g ^ q x such that 4/g + 4/g = 1, and 

E[#] < (1 + 39-i||a || 9 )E[<] + S^ll^oll^Eni^l^] + 3^E[C?] 
^ 3 9 (A 2 + B 9 A 2 + B«B9/9- + B) =: i4g. 

Letting ^f* 1 = .B^Ag^n -1 / 4 -> we have Af^ f = p/n 1 / 2 -> since p, B, q and cfc are fixed. 

Finally, we set c = cf M , C = max{Cf STE , C 2 A5T£ , C? M ,C$ M }, S n = maxj^™, < STE , 
€f + €f + €f } 0, and A n = max{A£f™ + Af r f + Aff , Af r f } 0. □ 

We will make use of the following technical lemma in the verification of examples 2, 3, and 

4. 

Lemma 2 (Uniform Approximation). Let hi = x'fih + p% be a function whose coefficients 
h e S a A {p), and k < A min (E[x;^]) ^ A max (E[x^]) < R. For s = A l / a n l / 2a , a > 1, define /3 h0 
as in \5.29\) , r^i = hi — x'^hO; f or i = 1, . . . ,n. Then we have 



\r h i\ < IM|oo(k/k) 3 / 2 I p y/^Jn + 5 \JsE [p 2 ] / k J + 



Pi ■ 



Proof. Let denote the support of /3/jo an d S 1 denote the support of the s largest components 
of 6h- Note that \Th\ = \S\ = s. First we establish some auxiliary bounds on the ||#/i[T£]|| and 
H^fcPhllli- By the optimality of Th and (3ho we have that 



H(hi - x'Ao) 2 ] < VV[(xi[S c ]'0h[S c ]+Pi) 2 ] < v^||^[S c ]|| + a/E^ 2 ] and 



E[(^ - xj&o) 2 ] = VE^i^ - /3ho) + P*} 2 ] > - JE[p 
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Thus we have \\O h [Tfl\\ < \/K~/K\\9 h [S c ] \\ + 2^E[pf\/K. Moreover, since 9 h € S%(p), we have 

oo oo 

\\8 h [S c }\\ 2 = e lu) < A2 E ^'" 2a ^ ^ 2 s" 2a+1 /[2a - 1] < A 2 S " 2a+1 
j=s+i i=s+i 

since a > 1. Combining these relations we have 



\e h m\ < v / ^^' a+1/2 + 2^E[ P 2 ]A 



= \/s7k\/s7^+2^/e[p 2 ]/k. 

The second bound follows by observing that 

ll^^lli < v^||^[T, c n5]|| + H^^ll! < V~s\Mm\\ + As~ a+1 /[a- 1 



«S V^V^/k + 2y'sE[ / o?]//e + (s/v^)/[a - 1] 
< ^f^hVWR a/[a - 1] + 2^/sE[p?]/«. 

By the first-order optimality condition of the problem (15.29H that defines /3^o, we have 
EixiiTdxilTtfWholTh] - 9 h [T h ]) = E[x t [T h ]x t {T^}e h [T^ + E[x i [T h )p i }. 



Thus, since HE^i^]^]!! = sup M=1 E[r]'xi[T h ]pi] < sup||^|| =1 ^[(r/'xifTj) 2 ] WE[p 2 ] we have 



«||/9*>-0ftM ^||^[T£]|| + y^E^ 



where the last inequality follows from the definition of s = A 1 l a n l / 2a . Therefore 

Vhi\ = \hi - x'^hol = {x'^dh - f3 h0 )\ + \pi\ 

< IM|oo||0fc - Pho\\l + \Pi\ 

< V^lkillool^/iTh - /Sfto|| + IkillooH^TjIl + |Pi| 

< INiHoolV^V^ (^) 3/2 + yJsE\fi^]/K y/Rfg{\ + 2^/^)} + 

+ lkj||oo(\/ s2 / n V / ^/^ a /l a - !] + 2y // sE[p 2 ]/K) + | ^1 

< INi||<»(«//0 3/2 {^V^ + 5\/«ERi7«} + bi|. 



□ 

F.2. Verification for Example 2. Let P be the collection of all regression models P that 
obey the conditions set forth above for all n for the given constants (k, R, a, A, B, x) and 
sequences p n and 5 n . Below we provide explicit bounds for k', k" , c, C, 5 n and A n that appear 
in Conditions ASTE, SE and SM that depend only on (re, re, a, A, B, x), P, 5 n and n which in 
turn establish these conditions for any P € P. In what follows we exploit Gaussianity of Wi 
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and use that (E[|?/i(;j| fc ]) 1//fc ^ (^(Edr/u^l 2 ]) 1 / 2 for any vector r/, \\rj\\ < oo, where the constant 
Gk depends on k only. 

Conditions ASTE(i) is assumed. Condition ASTE(ii) holds with ||a || < B =: C^ STE . 
Because m ,6 g £ S^(p), Condition ASTE(iii) holds with 

v v 

s = A 1/a n 1/2a , r mi = m(zi) - ^ Zij/3 m0j , and r gi = g(zi) - ^ Zij/3g j 

3=1 3=1 
where ||/3 m o||o ^ s and ll^oollo ^ s. Indeed, we have 



E[r 2 mi ] < E 



e ™(3) Z i(j) 



R Yl e2 m(J) ^ KA 2 s- 2a+1 /[2a - 1] < Rs/n 



where the first inequality follows by the definition of /3 m ,o in (|5.29|) . the second inequality 
follows from 8 m G S^(p), and the last inequality because s = A l / a n l / 2a . Similarly we have 
Hr 2 gi ] < E[(Z^ s+1 9 g(j) z tU) ) 2 } ^ RA 2 s~ 2 ^/[2a - 1] < Rs/n. Thus let C^ TE := y/J. 

Condition ASTE(iv) holds with 5^ TE := A 2 / a n 1 / a ~ 1 \og 2 (p V n) -> since s = A l l a n 1 / 2a , 
A is fixed, and the assumed condition n^ 1 ~ a ^ a log 2 (p V n) log 2 n ^ 5 n — > 0. 

The moment restrictions in Condition ASTE(v) are satisfied by the Gaussianity. Indeed, we 
have for q = 4/x (where x < 1 by assumption) 

EflC^] < 2«- 1 E[|C|] + 2«- 1 E[|4|] < 2 q ~ 1 G\{E\Q] q l 2 + E[r 2 gi ] q / 2 ) 
< 2 q ' 1 G q q {R q / 2 + R q l 2 {s/n) q l 2 } 
SC 2 q G q R q / 2 =: C£ STE 

for s ^ n, i.e., n ^ n$f TE := A 2 /l 2a_1 l. Similarly, E[\vi\ q ] ^ C^ 5TS . Moreover, 

|E[Ci^ 2 ] - E\tfv 2 ]\ < E[C 2 rU + E[rM + E[r^ri] 



< V E tC 4 ]E[rL] + ^W^l + \/ E[r ™ ]E[ 4] 

< G 2 RE[r 2 J + G 2 RE[r 2 J + G 2 E[r^]E[r 2 J 

< G\R 2 {2 + K S /n}s/n =: 5^ TE -> 0. 

Next note that by Gaussian tail bounds and A max (E[wjt(;^]) ^ k we have 

max^ n Halloo ^ llE^llloo + max^ n - EfxjJUoo 

^ \/R + yj2R\og{pn) with probability at least 1 — A^f TE 



(F.46) 



where Af^ TE = 1/ y / 2Klog(pn). The last requirement in Condition ASTE(v) holds with 
Q = 4/x 

maxllxillLs^ 1 / 2 ^ < 6Klog(pn)A 1 /«nK-3+x/2 =: 



with probability 1 — Af r f TE . By the assumption on a, p, x, and n, ^^f T ' E — > 
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To verify Condition SE with £ n = logn note that the minimal and maximal eigenvalues of 
E[xjX^] are bounded away from zero by k > and from above by R < oo uniformly in n. Also, 
let n = E[xi] so that xi = Xi + \i where Zi is zero mean. By constriction EfxjX^] = E[xjX^] + fifi' 
and ||/i|| ^ \/R. 

For any r\ £ W, \\rj\\o ^ k := s log ri and \\rj\\ = 1, we have that 

E n [(rj' Xi ) 2 ] - E[( V 'xi) 2 ] = E n [(r/x 4 ) 2 ] - E[(r/^) 2 ] + 2r/E n [x i ] ■ r//i. 



Moreover, by Gaussianity of Xi, with probability 1 — A^, where Af^ = l/-i/2relog(pn), 
|7/E n [xi]| < ||ry||i||E n [xi]|| 00 ^ \/kyj2R\og(jpn)/ ' y/n 

Vf lA ^ IMI II/"!! ^ V^- 

By the sub-Gaussianity of x,{ = (E[xjx'J — /i//) -1 / 2 ^, where Vtj ~ iV(0, I p ), by Theorem 3.2 
in Rudelson and Zhou (2011) (restated in Lemma [TU1 in Appendix G) with r = 1/6, /c = slogn, 
a = a/8/3, provided that 



n ^ iV n := 80(a 4 /r 2 )(slogn) log(12ep/[rs log n]), 

we have 

(1 - t) 2 E[(t/^) 2 ] ^ E n [(r/x,) 2 ] < (1 + r) 2 E[(r ? / x l ) 2 ] 

with probability 1 - Af,f , where Af,f = 2exp(-r 2 n/80a 4 ). Note that under ASTE(iv) we 
have Afjf -> and 



01 



max{n : n ^ N n } ^ max{(12e/r) 2a A" 2 , 80 2 (a 8 /r 4 )^ 2/a , n*} 



where n* is the smallest n such that 6 n < 1. 

Therefore, with probability 1 — Af,f and ra ^ raof' we have for any r\ 6 R p , ||t?||o ^ & and 
IMI = !> 

E n [(?/ X i) 2 ] ^ E[(V^) 2 ] - |E n [(Vx,) 2 ] - E^'x,) 2 ]! 

^ E[(r/x,;) 2 ] - |E n [(r ? / x l ) 2 ] - E[(? /x,) 2 ]| - 2| r/E w [x,]| • \r/fi\ 
^ E[(r]'xi) 2 }{l - 2r - r 2 } - 2Ryj2k\og(jm)/ y/n 
> E[(7?'x i ) 2 ]/2 - 2Ry/2klog{pn)/y/E 

since r = 1/6 and E[(?/xj) 2 ] ^ E[(?/xj) 2 ]. So for n ^ tiq^ := 288/c(k/k) 2 log(pn) we have 

^min^lognHEnfxix'j]] ^ k/3 =: k'. 
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Similarly, we have 

E n [(r/x 4 ) 2 ] < E[(tM) 2 ] + \E n [( V ' Xi ) 2 } - E[(?/x,) 2 ]| 

< Eltfxi) 2 } + \E n [( V 'xi) 2 ] - E[(r fxi) 2 ]\+2\ rfE n [x i ]\ • |r/V| 

E[(r]'xi) 2 ]{l + 2r + r 2 } + 2R y / 2klog(pn)/^i 
^ 2E[(r l l x i ) 2 } + 2Ry/2klog(pn)/^E 

since r = 1/6 and E[(?/xj) 2 ] ^ E[(r/xj) 2 ]. So for n ^ raj^p := 2klog(pn) we have 

^maxCslogn)^^^-]] sC AR =: k" . 

The second and third requirements in Conditions SM(i) holds by the Gaussianity of wt, 
E[Q | Xi, Vi] = 0, E[v i | Xi] = 0, and the assumption that the minimal and maximum eigenvalues 
of the covariance matrix (operator) Efwjtt^] are bounded below and above by positive absolute 
constants. 

The first requirement in Condition SM(i) and Condition SM(ii) also hold by Gaussianity. 
Indeed, we have for = Vi and ej = d, yi = di and yi = yi 

E[|«?|] + E[|C?|] < 2^ 1 G^{(EK 2 ])«/ 2 + (E[C 2 ])^ 2 } < 2 q G%R q l 2 =: A 1 

E[\dj\] < 2«-iE[\e> m z\«] + 2^E[\v*\] < 2^GUE[\6' m z\ 2 ])^ + 2^G q (E[v 2 })^ 

< 2i- l G q q \\6 m \\iR ( il 2 + 2i~ l G q q R q / 2 2iG q q R q / 2 (l + (2A)«) =: R 2 
E[d 2 } ^ 2E[|0^| 2 ] + 2E[v 2 } < 2/t||^ m || 2 + 2ft ^ 2R(AA 2 + 1) =: A' 2 
E[y 2 ] < 3|a | 2 E[d 2 ] + 3E[|^z| 2 ] + 3E[C 2 ] S^ 2 ^ + 3^' 2 + 3« =: ,4 3 

max Wp E[4y 2 ] < max 1<Kp (E[4]) 1 /2( E [y|])i/2 ^ ^ max 1<Kp E[x 2 .]E[y 2 ] 
«C G^k(^ V A 3 ) =: A 4 
rnax^ Kp E[| W H < max 1<Kp (E[4]) 1 /2( E [ e f])V2 ^ max 1 ^ Kp (E[x 2 J .]) 3 / 2 (E[6 2 ]) 3 / 2 

< G^ 3 =: A 5 
maxi <Kp l/E[x 2 ] l/A min (E[wi^]) < 1/k =: A e 

because \\9 m \\ ^ 2A and ||6>J < 2A since @rm@g ^4,(p)* Thus the first requirement in 
Condition SM(i) holds with Cf M = A 2 . Condition SM(ii) holds with Cf M = A x + (A' 2 V A 3 ) + 
^ 4 + ^5 + Re- 
condition SM(iii) is assumed. 

To verify Condition SM(iv) note that for 6j = and = Q, by (|F.46|) . with probability 
1 - Af^ TE , 



max i<p jE„[4e^] < max Kp ^[x?.] #E n [e?] 



< + V 2 ^!og(^)} max,-^ , 4 /E n [xf,] {/E n [ef ] 



(F.47) 
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By Lemma [3] with fc = 4we have with probability 1 — A ^ , where A ^ = 1 jn 

(F.48) 



max Kp yE n [xfj] < ||E[:Ej] ||oo + max^ p y / E n [(x^ - E[x^]) 4 ] 



< VR + v^2C + v / ^^" 1/ V 21 °g( 2 ^ n ) < 4C>/s 

for n ^ Roi = 41og 2 (2pn). Also, Lemma [3] with k = 8 and p = 1 we have with probability 
1 - Af* f that 



^En[ef] < 2R8C 2 + 2Kra~ 1/4 21og(2n) ^ 20C 2 R (FAQ) 
for n ^ ^-02^ = 161og 4 (2?i). Moreover, we have 



S&VWl < max ^E[x«]^E [e f] < GgK . 
Applying Lemma El for r = 2Af^ TE + Af*' 7 , with probability 1 — 8r we have 



[4 e ?]l < 4 V 21 ° g(Vr) . /Q(max En^ef], 1 - r) V 



i<P V n y j v n 

where by ffF~47]> . (lE48i) and (lE49l) we have 



Q(max 1<j<p JE n [xfrf] , 1 - r) < R 2 ^2 log{pn)80C 3 



So we let 6%*? = 640C 3 K 2 y log(2 f /r V fog(p^) V2 V / 2^|- -> under the condition that log 2 (pV 

Similarly for yi = and y^ = yi, by Lemma O we have with probability 1 — Af^ 1 , for 
n f we have 



E n [yf] ^\m]\ + v®nKvi-nvi\) s ] (F50) 

< [A' 2 V v4 3 ] 1/2 + (20C 2 E[y 2 ]) 1 /2 ^ qc[A' 2 V Ag] 1 ^. 

Moreover, ^E[yf] < G|E[y 2 ] < Gj{A' 2 V A 3 ). Therefore by Lemma E for r = 2A^ n STE + Aff , 
with probability 1 — 8r we have by the arguments in (|F.47|) . (|F.48|) . and (|F,50|) 

max|(E„ - E)[4-y 4 2 ]| < J 2 hg{2pM ^R log(^)4C v^(36C 2 [A> 2 V As]) V g^gl^ V ^ =: ^ 

where <5f^ - > under the condition log 2 (p Vn)/nO„ ->0. 
We have that the last term in Condition SM(iv) satisfies with probability 1 — A^ TE 



max \\x 



2 S l ° g{p Vn) -^6R \og{pn)A l / a n~ 1+1 ' 2a log(p V n) =: 8™. 



i oo 

n 



Under ASTE(iv) and s = A 1 / a n 1 l 2a we have tff -> 0. 
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Finally, we set n = max{nfif TE ,nff ,n$f ,n$f ,n$£*}, C = max{Cf LSTE , C^ STE , 
2C£*™,C?", C$ M }, 5 n = maxj^e^C^e + €f + 4fl -+ 0, and A n = 
max{33A^ TE + 16Af n M , Af E } -> 0. 

□ 

Lemma 3. Lei /.y ~ N(0,Oj), crj ^ cr, independent across i = 1, . . . ,n, where j = 1, . . . ,p. 
Then, for some universal constant (7^1, we have that for any k ^ 2 and 7 € (0, 1) 

P f max {E n [\f^\}} 1/k > aCVk + arT 1 '*^ \og{2ph) \ < 7. 
\i<j<p J / 

Proo/. Note that P(E n [\f^\] > M) = P^f.jWk > Mn ) = P (Uj\\k > (Mn) 1 ^). 

Since |||/||fe — ||dlUI ^ 11/ ~~ ffllfe ^ 11/ — 511) we have that || • ||^ is 1-Lipschitz for k ^ 2. 
Moreover, 



hum < (E[\\f-j\\ k k }) 1/k = (E E ti4i]) 1/fc = ™ 1/fc (E[|/J- 1]) 1 / 

i=l 



= n 1 / fc {a J fc 2 fc / 2 r((A; + l)/2)/r(l/2)} 1 / fc < n^aVkC. 

By Ledoux and Talagrand (1991), page 21 equation (1.6), we have 

Pdl/.Jfc > (Mn) 1 /*) < 2exp(-{(Mn) 1 / fc - E[||/. i || fe ]} 2 /2aJ). 

Setting M := {aVkC + o-n~ 1 / k y/2 log(2p/ 7 )} fe , so that {Mn) l / k = n 1 / k aVkC+a^2 log(2p/ 7 ) 
we have by the union bound and cr ^ 

P( max E n [|/*|] > M) < p max P(E n [|/£|] ^ M) < 7. 

□ 



F.3. Verification for Example 3. Let P be the collection of all regression models P that 
obey the conditions set forth above for all n for the given constants (/, /, a, A, b, B, q) and the 
sequence 5 n . Below we provide explicit bounds for k" , c, C, 5 n and A n that appear in 
Conditions ASTE, SE and SM that depend only on (f,f,a,A,b,B,q) and 5 n which in turn 
establish these conditions for all P 6 P. 

Conditions ASTE(i) is assumed. Condition ASTE(ii) holds with ||a || < B =: C^ STE . 
Because 9 m ,9 g £ S^(p), Condition ASTE(iii) holds with 

v v 
s = A 1/a n^, r mi = m(zi) - ^ PmOjPj(zi) and r gi = g(zi) - ^ (3 g0j Pj(zi) 

3=1 3=1 
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where ll/3 m o||o ^ s and ||/3 g o||o ^ s. Indeed, we have 

2- 




E[<J < E > e m{j) P U) (zi) 



< f E ^0) < 7^ 2 «" 2a+1 /[2a - 1] = 



where the first inequality follows by the definition of f3 m o in (|5.29p . the second inequality 
follows from the upper bound on the density and orthogonality of the basis, the third inequality 
follows from 8 m G S\{p), and the last inequality because s = A 1 ' a n 1 ' 2a . Similarly we have 
E[r 2 4 ] < n&^s+iOgmV)) 2 ] < fA 2 s- 2a+1 /[2a - 1] = fs/n. Let C ASTE = y/f. 

Condition ASTE(iv) holds with 5^ TE := A 2 / a n l / a ~ l \og 2 (p V n) -> since s = A^n 1 / 2 ", 
^4 is fixed, and the assumed condition n^ 1 ~ a ^ a log 2 (p Vn)^ 6 n —> 0. 

Next we establish the moment restrictions in Condition ASTE(v). Because / ^ A m i n (E[xjx£]) ^ 
A m ax(E[xix'j]) ^ /, by the assumption on the density and orthonormal basis, and max^ n ||xi||oo ^ 
B, by Lemma [2] with pi = we have 

max \r mi \ V \r gi \ < max \\ Xi \\„(f/f)*/**=±y/*fi < B {f/ff' 2 ^-^^i =: 5^ TE 

where S2^ TE under s = A 1 / a n l l 2a and a > 1. 
Thus we have 

EIION < 2«- 1 E[|C?|] + 2i- l E[\r q gi \] < 2"~ 1 J B + 2^ 1 (< 5T£ ) 9 
^ 2?- 1 J B + 2«- 1 (^ TE )9 =: Cf TE . 

Similarly, E[|-Uj| <? ] ^ C^ STE . Moreover, since $2n TE ~ ^ we have 

|E[C 2 ^ 2 ] - E[C 2 ^ 2 ] | < nQrU+Wlvf] + E[r^r J] 



< VE[C/]E[r^] + ^/E^JEK 4 ] + ^E[r^]E[^] 

< 2B 2 /«(5l STE ) 2 + {5^ TE Y =■■ 5l STE 0. 
Finally, the last requirement holds because (1 — a) /a + 4/q < implies 

maxll^HL^n- 1 / 2 ^ < B 2 A ^ n ^i/2+2/ q = . gASTE ^ ^ 

since s = ^4 1 / a ?i 1 / 2a and max^ n H^Hoo ^ B. 

To show Condition SE with £ n = logn note that regressors are uniformly bounded, and 
minimal and maximal eigenvalues of E^a/J are bounded below by / and above by / uniformly 
in n. Thus Condition SE follows by Corollary 4 in the supplementary material in Belloni and 
Chernozhukov (2011b) (restated in Lemma[9]in Appendix G) which is based on Rudelson and 
Vershynin (2008). Let 

Sfn ■= 2CB\J s log n log(l + s log n) \/\og{p V n) Vlog n/^/n 
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and := {2/ f){5f® ) 2 + 5^(2///), where C is an universal constant. By this result and 
the Markov inequality, we have with probability 1 — Af^ 

k' :=[_/ 2 ^ </> min (slogra)[E n [x;x-]] ^ 4> m£LX (s log n)[E n [xjZ-]] ^ 2/ =: k" . 

We need to show that Af^ — > which follows from 5f^ — > 0. We have that 



, SE ^ 2CB(l + A) 2 V^log 2 (n)^/\og(pVn) 2 /nV 2 Mog 4 n /log(pVn) 



< i i & w v ^ = 2CB(1 + A)" 



'n ' V ra 2 / 3 V n 1 / 3 

By assumption we have log 3 p/ra ^ 5 n — > and a > 1 we have (5f^ — >■ 0. 

The second and third requirements in Condition SM(i) hold with Cf M = B 2 / q and cf M = b 
by assumption. Condition SM(iii) is assumed. 

The first requirement in Condition SM(i) and Condition SM(ii) follow by, for = V{ and 

e* = Ci, Vi = di and fa = yi 

E[K|]+E[|Cf|] <2B=:A 1 

E[|d?|] < 2'- 1 E[|^ m x i |9] + 2«- 1 E[| V f|] < 2*- 1 ||0 m ||*E[||^||g o ] + 2"-^ 

< 2 q ~ 1 (2A) q B q + 2 ? - 1 fi =: A 2 

E[d 2 } < 2/||0 m || 2 + 2E[u 2 ] < 8/A 2 + 2 J B 2 /" =: A' 2 
E[y 2 } < 3|a | 2 EK 2 ] + 3||0,||? EfH^fJ + 3E[C 2 ] 

< 3B 2 A' 2 + \2A 2 B 2 + W 2 ' q =: A 3 
max 1<Kp E[x?.y|] < 5 2 E[y 2 ] < 5 2 (A 2 V A 3 ) =: A A 

maxi^^E[|x iiei | 3 ] < £ 3 E[|e 3 |] «S B 3 B 3 / q =: A 5 
maxi^ Kp 1/E[a: 2 .] < l/A min (E[x^]) < 1/f =: A 6 

where we used that maxj^ H^Hoo ^ B, the moment assumptions of the disturbances, \\0 m \\ ^ 
||# m ||i ^ 2A, \\0 g \\i ^ 2A since rn ,O g G S^ip) for a > 1. Thus the first requirement in 
Condition SM(i) holds with C 2 5M = A 2 . Condition SM(ii) holds with Cf M := A 1 + (A' 2 V 
A3) + A 4 + A 5 + A 6 . 

To verify Condition SM(iv) note that for ej = and e, = Ci we have by Lemma [6] with 
probability 1 — 8r, where r = 1/logn, 



max |(E n - EM 3 ei]\ < ^^ligf™ ^[4^], 1 - r) V 



2,2l| <- /l. /21og(2p/r)^^ 1 _ JSi^l 



2 max ,/2E[x* ,ef 



^ 4v ^^^ 2 Q(^, 1 - r) V 

< 4 y 2 l°g( 2 Pl°g") ^2 j g2/ glogn = . 



where we used E[e 4 ] ^ 2? 4//<? and the Markov inequality. By the definition of r and the assumed 
rate log 3 (p V n)/n < 5 n — > 0, we have ^f*' 1 — >■ 0. 
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Similarly, we have for y~i = di and y~i = yi, with probability 1 — 8r 



2 max J2E[xfM 



max |(E n - E)[x^]\ < 4^/«l Q( max yE^tf], 1 - r) V 



< 4 j^£B> Q(yfiM, 1 - r) V 

< 4 ^ 2iog(2piogn) £? 2 ,4 7logre =: 5f n M 

where we used the Markov inequality and 

E[yf] < E[<] + 3 3 |a | 4 E[<] + 3 s ||0J|fE[|Ml£,] + 3 3 E[C 4 ] 
< A^ q + &B 4 A 4 2 /q + 3 3 (2,4) 4 B 4 + &B A '<i =: A 7 . 

By the definition of r and the assumed rate log 3 (p V n)/n ^ 5 n — > 0, we have <5f n A/ -> 0. 

The last term in the requirement of Condition SM(iv), because max^ n ||xi||tx3 ^ B and 
Condition ASTE(iv) holds, is bounded by := B 2 A 1 / a n 1 / 2a Iog(p V n)/n -»• 0. 

Finally, we set c = cf M , C = max{C^ TS , C 2 ASTB , 2C 3 A5T£ , Cf M , Cf M , C 3 5A/ }, 5 n = 
max{4, 6^ TE , 6? n STE , 5^ TE ,6£ n STE , «f« + + $f } -> 0, A n = max{16/ log n, A?*} -> 
0. □ 



Appendix G. Tools 

G.l. Moderate Deviations for a Maximum of Self-Normalized Averages. We shall be 
using the following result, which is based on Theorem 7.4 in (de la Peha, Lai, and Shao, 2009). 

Lemma 4 (Moderate Deviation Inequality for Maximum of a Vector). Suppose that 



S 



EJU U v 



3 2 



where Uij are independent variables across i with mean zero. We have that 

P (max \ Sj \ > *~\\ - 7 /2p)) < 7 (l + |) , 
where A is an absolute constant, provided for £ n > 

^ ^(l - 7/(2p)) < — — min M - 1, Mj : 



The proof of this result, given in Belloni, Chen, Chernozhukov, and Hansen (2010), follows 
from a simple combination of union bounds with the bounds in Theorem 7.4 in de la Peha, 
Lai, and Shao (2009). 
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G.2. Inequalities based on Symmetrization. Next we proceed to use symmetrization ar- 
guments to bound the empirical process. In what follows for a random variable Z let Q(Z, 1— r) 
denote its (1 — r)-quantile. 

Lemma 5 (Maximal inequality via symmetrization) . Let Z\ , . . . , Z n be arbitrary independent 
stochastic processes and J- a finite set of measurable functions. For any r G (0, 1/2), and 
5 € (0, 1) we have that with probability at least 1 — 4r — 45 

max|G„(/(Z i ))| < {4^/2 log(2| ^|/<5) Q (max ^/E n {f{Zrf], 1 - r ) } V 2 maxQ (|G»(/(Zi))l ,i 
Proof. Let 

e lB = V21og(2|J-|/5) Q (max^n [f(Zi)% 1 - r V e 2n = maxQ 1 , 5) 



and the event £ = {max j e jr ^~K n [f 2 (Zi)] ^ Q [max f £ jr i/E n [f 2 {Zi)\, 1 — rj } which satisfies 
P(£ ) ^ 1 — r. By the symmetrization Lemma 2.3.7 of van der Vaart and Wellner (1996) (by 
definition of em we have /3 n (x) 1/2 in Lemma 2.3.7) we obtain 

P{max /e ^|G n (/(Zi))| > 4e ln V 2e 2 n} < 4P{max /e ^ |G n ( £i /(^))| > em} 

< 4P{max /e ^ \G n (eif(Zi))\ > e ln \£} + 4r 

where £» are independent Rademacher random variables, P(ej = 1) = P(ej = — 1) = 1/2. 

Thus a union bound yields 

max|G n (/(Zi))| >4e ln V2e 2n ) < 4r + 4|^|maxP{|G n (e i /(Z i ))| > e ln \£ } . (G.51) 

We then condition on the values of Z±, . . . ,Z n and £, denoting the conditional probability 
measure as P £ . Conditional on Zi, . . . , Z n , by the Hoeffding inequality the symmetrized process 
G n (sif(Zi)) is sub-Gaussian for the L 2 (P n ) norm, namely, for / G J, P £ {|G n (£j/(Zj))| > x} ^ 
2 exp(— x /{2E n [/ 2 (Zi)]}). Hence, under the event S, we can bound 

F £ {\G n (e l f(Z i ))\>e ln \Z 1 ,...,Z n ,£} sC 2exp(-ef n /[2E n [/ 2 (Z i )]) 

< 2exp(-log(2|J-|/5)). 

Taking the expectation over Z±, . . . , Z n does not affect the right hand side bound. Plugging in 
this bound yields the result. □ 

The following specialization will be convenient. 

Lemma 6. Let r £ (0,1) and {(a^,ej)' E P p x P, i = 1, . . . , n} 6e random vectors that are 
independent across i. Then with probability at least 1 — 8r 



max |E n [4 e f] - E[4e?]| < 4,/ 2l0g(W Q f max E n [x|.ef], l-r)v2 max J^fcl! 
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Proof. Let Z< = fj(Zi) = x?.e?, J" = {/i, . . . , /„}, so that n^G^/^)) = E n [x?.e 
E[x?.<£]. Also, for n £ (0, 1/2) and r 2 £ (0, 1), let 



-1 



e ln = y/2log(2p/ri)\ Q max E„[a:f,ef], 1 - r 2 and e 2 „ = max <3(|G n (x?-e?)|, 1/2) 



where we have C2 n ^ maxi^-^p \/2E[x^-e^] by Chebyshev. 
By Lemma [5] we have 

P ( max |E n [*^ 2 ] - V[ X %4]\ > 4ei ^ 262n ') 



The result follows by setting t\ = r 2 = r < 1/2. Note that for r ^ 1/2 the result is trivial. □ 



G.3. Moment Inequality. We shall be using the following result, which is based on Markov 
inequality and (von Bahr and Esseen, 1965). 

Lemma 7 (Vonbahr-Esseen's LLN). Let r £ [1)2], and independent zero-mean random vari- 
ables Xi with Fi[\Xi\ r ] ^ C. Then for any i n > 

2C 



Pr 



n 



pr 
' n 



G.4. Matrices Deviation Bounds. In this section we collect matrices deviation bounds. 
We begin with a bound due to Rudelson (1999) for the case that p < n. 

Lemma 8 (Essentially in Rudelson (1999)). Let xi, i = l,...,n, be independent random 
vectors in M p and set 

rV /log(n Ap) 



S n := C- 



n 



for some universal constant C . Then, we have 



E 



sup |E n [(a'xi) 2 - E[(a'x^ 2 



\a =1 



E[ max ||xj|| 2 ]. 

IsSisSn 



^6~+8 n sup JE[(a'xi) 2 }. 

\\a\\=l 



Based on results in Rudelson and Vershynin (2008), the following lemma for bounded re- 
gressors was derived in the supplementary material of Belloni and Chernozhukov (2011b) 

Lemma 9 (Essentially in Theorem 3.6 of Rudelson and Vershynin (2008)). Let Xi, i = 1, . . . , n, 

be independent random vectors in MP be such that -y/E[maxi^j^ n lla^H^J ^ K. Let 5 n := 
2 ( CK\/k log(l + k)y/\og{p V n)y/\ogn \ /y/n, where C is the universal constant. Then, 



E 



sup E n [(a! a 

Q:||o;CA',||(-\:|| = l 



E[(a'xi)< 



< 5 n + 5 n sup 

||a||o<fc,H|=l 



E[(a'x,) 2 ] 
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Proof. Let 



V k = sup |E n [{a'xi) 2 -E[(a'xi)< 

||a||o^A;,||o||=l 



Then, by a standard symmetrization argument (Guedon and Rudelson (2007), page 804) 
nE[V k ] ^2E X E £ sup|| a |i 0<A . || a | (=1 \T>i=i £i{a'xi) 2 \ 



Letting 



sup E n [(a'xj) 2 ] and (p(k) = sup E[(a'xi) 2 ]. 

||a||osgfc,||a||^l ||a||o^fc,||«||=l 



we have 4>(k) ^ <f(k) + V k and by Lemma 3.8 in Rudelson and Vershynin (2008) to bound the 
expectation in e, 

nE[V k ) ^ 2(C'v / fclog(l + A;) v / log(pVn)v / log^) ^/nE 

sC 2 (cVklog(l + k)y/log(pV n)y/log nj y/n^jE x [max i<n H^H^] E x [0(/c)] 

The result follows by noting that for positive numbers v, A, B, v ^ A(v + B) 1 / 2 implies v ^ 
A 2 + A^B. □ 

The following result establishes an approximation bound for sub-Gaussian regressors and 
was developed in Rudelson and Zhou (2011). Recall that a random vector Z € R p is isotropic 
if E[ZZ'] = I, and it is called ip2 with a constant a if for every w € W we have 

|| Z'u; 11^ := inf{f : E[exp( {Z'w) 2 /t 2 )} ^ 2} ^ a||w|| 2 . 

Lemma 10 (Essentially in Theorem 3.2 of Rudelson and Zhou (2011)). Let ^$>i, i = 1, . . . ,n, 
be i.i.d. isotropic random vectors in M. p that are ?/>2 with a constant a. Let Xi = E 1 / 2 ^ so that 
E = E[xjX^]. For m ^ p and t £ (0, 1) assume that 

80ma 4 , / 12ep 

n ^ = — log 

t z \ rriT 

Then with probability at least 1 — 2 exp(— r 2 n/80a 4 ), for all u £ W , ||tt||o ^ we /iaue 
(1 - rJHEVauHa < ^ n [{ x '. u f\ < (1 + r^E 1 / 2 ^^ 

For example, Lemma [TUl covers the case of x« ~ iV(0, E) by setting ~ N(0,L) which is 
isotropic and tp2 with a constant a = a/8/3. 
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Table 1. Simulation Results for Selected R 2 Values 





First Stage R 2 = .2 


First Stage R 2 = .2 


First Stage R 2 = .8 


First Stage R 2 = .8 




Structure R 2 = 


Structure R 2 = .8 


Structure R 2 = 


Structure R 2 = .8 


Estimation Procedure 


D MCE 
KIVI Jt 


rcej. Kale 


KIVI J t 


Rej. Rate 


RMSE 


Rej. Rate 


KlVlot 


Kcj. rcaie 










A. Design 1. 


Quadratic Decay 








Oracle 


0.090 


0.048 


0.090 


0.048 


0.045 


0.057 


0.045 


0.057 


Double-Selection Oracle 


102 


050 


102 


0.050 


0.143 


0.047 


143 


047 


Post-Lasso 


0.137 


0.205 


0.110 


0.064 


0.402 


0.987 


0.489 


0.974 


Double-Selection 


107 


063 


107 


0.058 


0.109 


0.074 


104 


062 


Double-Selection + Ridge 


260 


064 


256 


0.055 


0.132 


0.049 


130 


050 








B. Design 2 


. Quadratic Decay with Heteroscedasticity 






Oracle 


0.139 


0.060 


0.139 


0.060 


0.066 


0.062 


0.066 


0.062 


Double-Selection Oracle 


0.169 


0.072 


0.169 


0.072 


0.225 


0.085 


0.225 


0.085 


Post-Lasso 


0.175 


0.139 


0.178 


0.097 


0.409 


0.994 


0.501 


0.993 


Double-Selection 


0.165 


0.098 


0.167 


0.081 


0.162 


0.082 


0.165 


0.083 


Double-Selection + Ridge 


0.308 


0.060 


0.290 


0.058 


0.183 


0.064 


0.185 


0.075 








C. Design 3. 


Quadratic Decay with Random Coefficients 






Oracle 


0.070 


0.055 


0.070 


0.055 


0.041 


0.060 


0.041 


0.060 


Double-Selection Oracle 


0.114 


0.056 


0.114 


0.056 


0.151 


0.058 


0.151 


0.058 


Post-Lasso 


0.105 


0.082 


0.131 


0.133 


0.329 


0.940 


0.435 


0.953 


Double-Selection 


0.109 


0.055 


0.118 


0.075 


0.105 


0.056 


0.117 


0.086 


Double-Selection + Ridge 


0.227 


0.040 


0.230 


0.035 


0.151 


0.054 


0.153 


0.057 



Note: The table reports root-mean-square-error (RMSE) rejection rates for 5% level tests (Rej. Rate) from a Monte Carlo simulation 
experiment. Results are based on 1000 simulation replications. Data in Panels A and B are based on models with coefficients that decay 
quadratically, and the data in Panel C are based on a with five quadratically decaying coefficients and 95 random coefficients. Further details 
about the simulation models are provided in the text as are details about the estimation procedures. Rejection rates are for t-tests of the null 
hypothesis that the structural coefficient is equal to the true population value and are formed using jack-knife standard errors that are robust 
to heteroscedasticity; see MacKinnon and White (1985). 
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Post-Lasso RP(0.05) DS RP(0.05) Oracle DS RP(0.05) 




First Stage R z Second Stage R 2 First Stage R 2 Second Stage R 2 First Stage R 2 ° Second Stage R 2 



Post-Lasso Mean Bias DS Mean Bias Oracle DS Mean Bias 




First Stage R z Second Stage R 2 First Stage R 2 Second Stage R 2 First Stage R 2 Second Stage R 2 



Figure 2. This figure presents rejection frequencies for 5% level tests, biases, 
and standard deviations for estimating the treatment effect from Design 1 of the 
simulation study which has quadratically decaying coefficients and homoscedas- 
ticity. Results are reported for a one-step Post-Lasso estimator, our proposed 
double selection procedure, and the infeasible OLS estimator that uses the set of 
variables that have coefficients larger than 0.1 in either equation (12.6|) or (12.7p . 
Reduced form and first stage R 2 correspond to the population R 2 of (|2.6p and 
(12. 7|) respectively. Note that rejection frequencies are censored at 0.5. 
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Post-Lasso RP(0.05) DS RP(0.05) Oracle DS RP(0.05) 




Post-Lasso Mean Bias DS Mean Bias Oracle DS Mean Bias 




Figure 3. This figure presents rejection frequencies for 5% level tests, biases, 
and standard deviations for estimating the treatment effect from Design 2 of 
the simulation study which has quadratically decaying coefficients and het- 
eroscedasticity. Results are reported for a one-step Post-Lasso estimator, our 
proposed double selection procedure, and the infeasible OLS estimator that uses 
the set of variables that have coefficients larger than 0.1 in either equation (|2.6|) 
or (|2.7p . Reduced form and first stage R 2 correspond to the population R 2 of 
(I2.6P and (I2.7P respectively. Note that rejection frequencies are censored at 0.5. 
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Figure 4. This figure presents rejection frequencies for 5% level tests, biases, 
and standard deviations for estimating the treatment effect from Design 3 of 
the simulation study which has five quadratically decaying coefficients and 95 
Gaussian random coefficients. Results are reported for a one-step Post-Lasso 
estimator, our proposed double selection procedure, and the infeasible OLS 
estimator that uses the set of variables that have coefficients larger than 0.1 
in either equation (|2.6p or (|2.7p . Reduced form and first stage R 2 correspond 
to what would be the population R 2 of (I2.6P and (|2.7p if all of the random 
coefficients were equal to zero. Note that rejection frequencies are censored at 
0.5. 
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DS RP(0.05) DS RP(0.05) (Ridge) 




DS RMSE DS RMSE (Ridge) 




Figure 5. This figure presents rejection frequencies for 5% level tests and 
RMSE's for estimating the treatment effect from Design 3 of the simulation 
study which has five quadratically decaying coefficients and 95 Gaussian random 
coefficients. Results in the first column are for the proposed double selection 
procedure, and the results in the second column are for the proposed double 
selection procedure when the ridge fit from (12. 6p is added as an additional 
potential control. Reduced form and first stage R 2 correspond to what would 
be the population R 2 of (12.6f) and (I2.7P if all of the random coefficients were 
equal to zero. Note that the vertical axis on the rejection frequency graph is 
from to 0.1. 
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Table 2. Estimated Effects of Abortion on Crime Rates 

Violent Crime Property Crime Murder 

Effect Std. Err. Effect Std. Err. Effect Std. Err. 



A. Donohue III and Levitt (2001) Table IV 



Donohue III and Levitt (2001) Table IV 


-0.129 


0.024 


-0.091 


0.018 


-0.121 


0.047 


First-Difference 


-0.152 


0.034 


-0.108 


0.022 


-0.204 


0.068 


All Controls 


0.294 


0.472 


-0.068 


0.157 


0.321 


1.109 


Post-Double-Selection 


-0.087 


0.181 


-0.094 


0.051 


0.006 


0.280 


Post-Double-Selection+ 


-0.044 


0.176 


-0.093 


0.051 


-0.166 


0.216 



Note: The table displays the estimated coefficient on the abortion rate, "Effect," and its estimated standard error. Numbers in 
the first row are taken from Donohue III and Levitt (2001) Table IV, columns (2), (4), and (6). The remaining rows are estimated by 
first differences, include a full set of time dummies, and use standard errors clustered at the state-level. Estimates in the row 
labeled "First-Difference" are obtained using the same controls as in the first row. Estimates in the row labeled "All Controls" use 
251 control variables as discussed in the text. Estimates in the row "Post-Double-Selection" use the variable selection technique 
developed in this paper to search among the set of 251 potential controls. Estimates in the row "Post-Double-Selection+" use the 
variables selected by the procedure of this paper augmented with the set of variables from Donohue III and Levitt (2001). 



